aN 


Å ra 


ECONOMETRICS 


Mostly Harmless Econometrics 


Mh 


This page intentionally left blank 


Mostly Harmless Econometrics 
An Empiricist’s Companion 


T_T 


Joshua D. Angrist 
and 
Jorn-Steffen Pischke 


PRINCETON UNIVERSITY PRESS ™ PRINCETON AND OXFORD 


Copyright © 2009 by Princeton University Press 


Published by Princeton University Press, 41 William Street, 
Princeton, New Jersey 08540 
In the United Kingdom: Princeton University Press, 
6 Oxford Street, Woodstock, Oxfordshire OX20 1TW 


All Rights Reserved 
Library of Congress Cataloging-in-Publication Data 


Angrist, Joshua David. 
Mostly harmless econometrics : an empiricist’s companion / 
Joshua D. Angrist, Jorn—Steffen Pischke. 
. cm. 
Includes bibliographical references and index. 

ISBN 978-0-691-12034-8 (hardcover : alk. paper) — 
ISBN 978-0-691-12035-5 (pbk. : alk. paper) 1. Econometrics. 
2. Regression analysis. I. Pischke, Jorn-Steffen. II. Title. 
HB139.A54 2008 
330.01'5195—dc22 2008036265 


British Library Cataloging-in-Publication Data is available 


This book has been composed in Sabon 
with Hel. Neue Cond. family display 


Illustrations by Karen Norberg 
Printed on acid-free paper. oo 
press.princeton.edu 
Printed in the United States of America 


13579108642 


CONTENTS 


MMA 


List of Figures vii 
List of Tables ix 
Preface xi 
Acknowledgments xv 


Organization of This Book xvii 


| PRELIMINARIES 1 


1 Questions about Questions 3 


2 The Experimental Ideal 11 
2.1 The Selection Problem 12 
2.2 Random Assignment Solves the Selection Problem 
2.3 Regression Analysis of Experiments 22 


Il THECORE 25 


3 Making Regression Make Sense 27 
3.1 Regression Fundamentals 28 
3.2 Regression and Causality 51 
3.3 Heterogeneity and Nonlinearity 68 
3.4 Regression Details 91 
3.5 Appendix: Derivation of the Average Derivative 
Weighting Function 110 


4 Instrumental Variables in Action: Sometimes 
You Get What You Need 113 
4.1 IV and Causality 115 
4.2 Asymptotic 2SLS Inference 138 
4.3 Two-Sample IV and Split-SampleIV 147 


15 


vi Contents 


4.4 IV with Heterogeneous Potential Outcomes 150 
4.5 Generalizing LATE 173 

4.6 IV Details 188 

4.7 Appendix 216 


5 Parallel Worlds: Fixed Effects, Differences-in-Differences, 
and Panel Data 221 
5.1 Individual Fixed Effects 221 
5.2 Differences-in-Differences 227 
5.3 Fixed Effects versus Lagged Dependent Variables 243 
5.4 Appendix: More on Fixed Effects and Lagged 
Dependent Variables 246 


lll EXTENSIONS 249 


6 Getting a Little Jumpy: Regression Discontinuity 
Designs 251 
6.1 SharpRD 251 
6.2 Fuzzy RD Is IV 259 


7 Quantile Regression 269 
7.1 The Quantile Regression Model 270 
7.2 IV Estimation of Quantile Treatment Effects 283 


8 Nonstandard Standard Error Issues 293 
8.1 The Bias of Robust Standard Error Estimates 294 
8.2 Clustering and Serial Correlation in Panels 308 
8.3 Appendix: Derivation of the Simple Moulton Factor 323 
Last Words 327 
Acronyms and Abbreviations 329 
Empirical Studies Index 335 
References 339 


Index 361 


FIGURES 


MM _ 


3.1.1 Raw data and the CEF of average log weekly 

wages given schooling 31 
3.1.2 Regression threads the CEF of average weekly 

wages given schooling 39 
3.1.3 Microdata and grouped data estimates of the 

returns to schooling 41 


4.1.1 Graphical depiction of the first-stage and reduced 
form for IV estimates of the economic return to 
schooling using quarter of birth instruments 119 

4.1.2 The relationship between average earnings and 
the probability of military service 139 

4.5.1. The effect of compulsory schooling instruments 
on education 185 

4.6.1 Monte Carlo cumulative distribution functions 
of OLS, IV, 2SLS, and LIML estimators 211 

4.6.2 Monte Carlo cumulative distribution functions 
of OLS, 2SLS, and LIML estimators with 20 
instruments 211 

4.6.3 Monte Carlo cumulative distribution functions 
of OLS, 2SLS, and LIML estimators with 20 
worthless instruments 212 


5.2.1 Causal effects in the DD model 231 


5.2.2 Employment in New Jersey and Pennsylvania 
fast food restaurants, October 1991 to 
September 1997 232 


5.2.3 


5.2.4 


6.1.1 
6.1.2 


6.2.1 


7.1.1 


viii Figures 
Average grade repetition rates in second grade 


for treatment and control schools in Germany 


The estimated impact of implied-contract 
exceptions to the employment-at-will doctrine 
on the use of temporary workers 


The sharp regression discontinuity design 

The probability of winning an election by 

past and future vote share 

The fuzzy-RD first-stage for regression- 
discontinuity estimates of the effect of class size 
on test scores 


The quantile regression approximation property 


233 


239 
254 


258 


265 
280 


2.2.1 


222 


3.2.1 


3.3.1 


3.3.2 


3.3.3 


3.4.1 


34:2 


4.1.1 


4.1.2 


4.1.3 


4.1.4 


4.4.1 


4.4.2 


TABLES 


MM a 


Comparison of treatment and control characteristics 


in the Tennessee STAR experiment 


Experimental estimates of the effect of class size 
on test scores 


Estimates of the returns to education for men in 
the NLSY 

Uncontrolled, matching, and regression estimates 
of the effects of voluntary military service on 
earnings 

Covariate means in the NSW and observational 
control samples 

Regression estimates of NSW training effects 
using alternative controls 

Average outcomes in two of the HIE treatment 
groups 

Comparison of alternative estimates of the 
effect of childbearing on LDVs 


2SLS estimates of the economic returns to 
schooling 

Wald estimates of the returns to schooling using 
quarter-of-birth instruments 

Wald estimates of the effects of military service 
on the earnings of white men born in 1950 
Wald estimates of the effects of family size on 
labor supply 

Results from the JTPA experiment: OLS and IV 
estimates of training impacts 

Probabilities of compliance in instrumental 
variables studies 


19 


20 


62 


73 


88 


89 


96 


106 


124 


129 


130 


133 


163 


169 


4.4.3 


4.6.1 


x Tables 


Complier characteristics ratios for twins and 
sex composition instruments 

2SLS, Abadie, and bivariate probit estimates 
of the effects of a third child on female 
labor supply 


Alternative IV estimates of the economic returns 


to schooling 


Estimated effects of union status on wages 
Average employment in fast food restaurants 


before and after the New Jersey minimum wage 


increase 


Regression DD estimates of minimum wage 
effects on teens, 1989-1990 


Estimated effects of labor regulation on the 
performance of firms in Indian states 


OLS and fuzzy RD estimates of the effect of 
class size on fifth-grade math scores 


Quantile regression coefficients for schooling 
in the 1970, 1980, and 2000 censuses 
Quantile regression estimates and quantile 
treatment effects from the JTPA experiment 


Monte Carlo results for robust standard error 
estimates 


Standard errors for class size effects in the 
STAR data 


172 


214 
225 


PREFACE 


MM 


he universe of econometrics is constantly expand- 
ing. Econometric methods and practice have advanced 
greatly as a result, but the modern menu of econo- 
metric methods can seem confusing, even to an experienced 
number cruncher. Luckily, not everything on the menu is 
equally valuable or important. Some of the more exotic items 
are needlessly complex and may even be harmful. On the 
plus side, the core methods of applied econometrics remain 
largely unchanged, while the interpretation of basic tools has 
become more nuanced and sophisticated. Our Companion is 
an empiricist’s guide to the econometric essentials ... Mostly 
Harmless Econometrics. 
The most important items in an applied econometrician’s 
toolkit are: 


1. Regression models designed to control for variables that 
may mask the causal effects of interest; 

2. Instrumental variables methods for the analysis of real and 
natural experiments; 

3. Differences-in-differences-type strategies that use repeated 
observations to control for unobserved omitted factors. 


The productive use of these basic techniques requires a solid 
conceptual foundation and a good understanding of the 
machinery of statistical inference. Both aspects of applied 
econometrics are covered here. 

Our view of what’s important has been shaped by our expe- 
rience as empirical researchers, and especially by our work 
teaching and advising economics Ph.D. students. This book 
was written with these students in mind. At the same 
time, we hope the book will find an audience among other 
groups of researchers who have an urgent need for practical 
answers regarding choice of technique and the interpretation 
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of research findings. The concerns of applied econometrics 
are not fundamentally different from those of other social sci- 
ences or epidemiology. Anyone interested in using data to 
shape public policy or to promote public health must digest 
and use statistical results. Anyone interested in drawing useful 
inferences from data on people can be said to be an applied 
econometrician. 

Many textbooks provide a guide to research methods, and 
there is some overlap between this book and others in wide 
use. But our Companion differs from econometrics texts in a 
number of important ways. First, we believe that empirical 
research is most valuable when it uses data to answer spe- 
cific causal questions, as if in a randomized clinical trial. This 
view shapes our approach to most research questions. In the 
absence of a real experiment, we look for well-controlled com- 
parisons and/or natural quasi-experiments. Of course, some 
quasi-experimental research designs are more convincing than 
others, but the econometric methods used in these studies are 
almost always fairly simple. Consequently, our book is shorter 
and more focused than textbook treatments of econometric 
methods. We emphasize the conceptual issues and simple sta- 
tistical techniques that turn up in the applied research we read 
and do, and illustrate these ideas and techniques with many 
empirical examples. 

A second distinction we claim is a certain lack of gravitas. 
Most econometrics texts appear to take econometric models 
very seriously. Typically these books pay a lot of attention 
to the putative failures of classical modeling assumptions, 
such as linearity and homoskedasticity. Warnings are some- 
times issued. We take a more forgiving and less literal-minded 
approach. A principle that guides our discussion is that the 
estimators in common use almost always have a simple inter- 
pretation that is not heavily model dependent. If the estimates 
you get are not the estimates you want, the fault lies in the 
econometrician and not the econometrics! A leading example 
is linear regression, which provides useful information about 
the conditional mean function regardless of the shape of this 
function. Likewise, instrumental variables methods estimate 
an average causal effect for a well-defined population even 
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if the instrument does not affect everyone. The conceptual 
robustness of basic econometric tools is grasped intuitively by 
many applied researchers, but the theory behind this robust- 
ness does not feature in most texts. Our Companion also 
differs from most econometrics texts in that, on the inference 
side, we are not much concerned with asymptotic efficiency. 
Rather, our discussion of inference is devoted mostly to the 
finite-sample bugaboos that should bother practitioners. 

The main prerequisite for understanding the material here 
is basic training in probability and statistics. We especially 
hope that readers are comfortable with the elementary tools 
of statistical inference, such as f-statistics and standard errors. 
Familiarity with fundamental probability concepts such as 
mathematical expectation is also helpful, but extraordinary 
mathematical sophistication is not required. Although impor- 
tant proofs are presented, the technical arguments are not very 
long or complicated. Unlike many upper-level econometrics 
texts, we go easy on the linear algebra. For this reason and oth- 
ers, our Companion should be an easier read than competing 
books. Finally, in the spirit of Douglas Adams’s lighthearted 
serial (The Hitchhiker’s Guide to the Galaxy and Mostly 
Harmless, among others) from which we draw continued 
inspiration, our Companion may have occasional inaccura- 
cies, but it is quite a bit cheaper than the many versions 
of the Encyclopedia Galactica Econometrica that dominate 
today’s market. Grateful thanks to Princeton University Press 
for agreeing to distribute our Companion on these terms. 


This page intentionally left blank 


ACKNOWLEDGMENTS 


MM a 


e had the benefit of comments from many friends 
W and colleagues as this project progressed. Thanks are 

due to Alberto Abadie, Patrick Arni, David Autor, 
Amitabh Chandra, Monica Chen, Victor Chernozhukov, John 
DiNardo, Peter Dolton, Joe Doyle, Jerry Hausman, Andrea 
Ichino, Guido Imbens, Adriana Kugler, Rafael Lalive, Alan 
Manning, Whitney Newey, Derek Neal, Barbara Petrongolo, 
James Robinson, Gary Solon, Tavneet Suri, Jeff Wooldridge, 
and Jean-Philippe Wullrich, who reacted to the manuscript at 
various stages. They are not to blame for our presumptuous- 
ness or remaining mistakes. Thanks also go to our students at 
LSE and MIT, who saw the material first and helped us decide 
what’s important. We would especially like to acknowledge 
the skilled and tireless research assistance of Bruno Ferman, 
Brigham Frandsen, Cynthia Kinnan, and Chris Smith. We’re 
deeply indebted to our infinitely patient illustrator, Karen 
Norberg, who created the images at the beginning of each 
chapter and provided valuable feedback on matters large and 
small. We’re also grateful for the enthusiasm and guidance of 
Tim Sullivan and Seth Ditchik, our editors at Princeton Univer- 
sity Press, and for the careful work of our copy editor, Marjorie 
Pannell, and production editor, Leslie Grundfest. Last, but 
certainly not least, we thank our wives for their love and sup- 
port. They know better than anyone what it means to be an 
empiricist’s companion. 


This page intentionally left blank 


ORGANIZATION OF THIS BOOK 


T_T 


e begin with two introductory chapters. The first 

describes the type of research agenda for which the 

material in subsequent chapters is most likely to be 
useful. The second discusses the sense in which random- 
ized trials of the sort used in medical research provide an 
ideal benchmark for the questions we find most interest- 
ing. After this introduction, the three chapters of part II 
present core material on regression, instrumental variables, 
and differences-in-differences. These chapters emphasize both 
the universal properties of estimators (e.g., regression always 
approximates the conditional mean function) and the assump- 
tions necessary for a causal interpretation of results (the 
conditional independence assumption; instruments as good 
as randomly assigned; parallel worlds). We then turn to 
important extensions in part III. Chapter 6 covers regression 
discontinuity designs, which can be seen as either a varia- 
tion on regression-control strategies or a type of instrumental 
variables strategy. In chapter 7, we discuss the use of quan- 
tile regression for estimating effects on distributions. The last 
chapter covers important inference problems that are missed 
by the textbook asymptotic approach. Some chapters include 
more technical or specialized sections that can be skimmed or 
skipped without missing out on the main ideas; these sections 
are indicated with a star. A glossary of acronyms and abbre- 
viations and an index to empirical examples can be found at 


the back of the book. 
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Part I 


Preliminaries 


Chapter 1 


Questions about Questions 


al 


“I checked it very thoroughly,” said the computer, “and that 
quite definitely is the answer. I think the problem, to be quite 
honest with you, is that you’ve never actually known what 
the question is.” 

Douglas Adams, The Hitchhiker’s Guide to the Galaxy 


his chapter briefly discusses the basis for a successful 

research project. Like the biblical story of Exodus, a 

research agenda can be organized around four questions. 
We call these frequently asked questions (FAQs), because they 
should be. The FAQs ask about the relationship of interest, the 
ideal experiment, the identification strategy, and the mode of 
inference. 

In the beginning, we should ask, What is the causal rela- 
tionship of interest? Although purely descriptive research has 
an important role to play, we believe that the most interesting 
research in social science is about questions of cause and effect, 
such as the effect of class size on children’s test scores, dis- 
cussed in chapters 2 and 6. A causal relationship is useful for 
making predictions about the consequences of changing cir- 
cumstances or policies; it tells us what would happen in alter- 
native (or “counterfactual”) worlds. For example, as part of 
a research agenda investigating human productive capacity— 
what labor economists call human capital—we have both 
investigated the causal effect of schooling on wages (Card, 
1999, surveys research in this area). The causal effect of 
schooling on wages is the increment to wages an individual 
would receive if he or she got more schooling. A range of 
studies suggest the causal effect of a college degree is about 40 
percent higher wages on average, quite a payoff. The causal 
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effect of schooling on wages is useful for predicting the earn- 
ings consequences of, say, changing the costs of attending 
college, or strengthening compulsory attendance laws. This 
relation is also of theoretical interest since it can be derived 
from an economic model. 

As labor economists, we’re most likely to study causal 
effects in samples of workers, but the unit of observation in 
causal research need not be an individual human being. Causal 
questions can be asked about firms or, for that matter, coun- 
tries. Take, for example, Acemoglu, Johnson, and Robinson’s 
(2001) research on the effect of colonial institutions on eco- 
nomic growth. This study is concerned with whether countries 
that inherited more democratic institutions from their colonial 
rulers later enjoyed higher economic growth as a consequence. 
The answer to this question has implications for our under- 
standing of history and for the consequences of contemporary 
development policy. Today, we might wonder whether newly 
forming democratic institutions are important for economic 
development in Iraq and Afghanistan. The case for democ- 
racy is far from clear-cut; at the moment, China is enjoying 
robust economic growth without the benefit of complete polit- 
ical freedom, while much of Latin America has democratized 
without a big growth payoff. 

The second research FAQ is concerned with the experi- 
ment that could ideally be used to capture the causal effect 
of interest. In the case of schooling and wages, for example, 
we can imagine offering potential dropouts a reward for fin- 
ishing school, and then studying the consequences. In fact, 
Angrist and Lavy (2008) have run just such an experiment. 
Although their study looked at short-term effects such as col- 
lege enrollment, a longer-term follow-up might well look at 
wages. In the case of political institutions, we might like to 
go back in time and randomly assign different government 
structures in former colonies on their independence day (an 
experiment that is more likely to be made into a movie than 
to get funded by the National Science Foundation). 

Ideal experiments are most often hypothetical. Still, hypo- 
thetical experiments are worth contemplating because they 
help us pick fruitful research topics. We’ll support this claim by 
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asking you to picture yourself as a researcher with no budget 
constraint and no Human Subjects Committee policing your 
inquiry for social correctness: something like a well-funded 
Stanley Milgram, the psychologist who did pathbreaking work 
on the response to authority in the 1960s using highly contro- 
versial experimental designs that would likely cost him his job 
today. 

Seeking to understand the response to authority, Milgram 
(1963) showed he could convince experimental subjects to 
administer painful electric shocks to pitifully protesting victims 
(the shocks were fake and the victims were actors). This turned 
out to be controversial as well as clever: some psychologists 
claimed that the subjects who administered shocks were psy- 
chologically harmed by the experiment. Still, Milgram’s study 
illustrates the point that there are many experiments we can 
think about, even if some are better left on the drawing board.! 
If you can’t devise an experiment that answers your question 
in a world where anything goes, then the odds of generat- 
ing useful results with a modest budget and nonexperimental 
survey data seem pretty slim. The description of an ideal exper- 
iment also helps you formulate causal questions precisely. The 
mechanics of an ideal experiment highlight the forces you’d 
like to manipulate and the factors you’d like to hold constant. 

Research questions that cannot be answered by any exper- 
iment are FUQs: fundamentally unidentified questions. What 
exactly does a FUQ look like? At first blush, questions about 
the causal effect of race or gender seem good candidates 
because these things are hard to manipulate in isolation 
(“imagine your chromosomes were switched at birth”). On 
the other hand, the issue economists care most about in the 
realm of race and sex, labor market discrimination, turns on 
whether someone treats you differently because they believe 
you to be black or white, male or female. The notion of a 
counterfactual world where men are perceived as women or 
vice versa has a long history and does not require Douglas 
Adams-style outlandishness to entertain (Rosalind disguised 


'Milgram was later played by the actor William Shatner in a TV special, 
an honor that no economist has yet received, though Angrist is still hopeful. 
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as Ganymede fools everyone in Shakespeare’s As You Like 
It). The idea of changing race is similarly near-fetched: in The 
Human Stain, Philip Roth imagines the world of Coleman 
Silk, a black literature professor who passes as white in pro- 
fessional life. Labor economists imagine this sort of thing all 
the time. Sometimes we even construct such scenarios for the 
advancement of science, as in audit studies involving fake job 
applicants and résumés.” 

A little imagination goes a long way when it comes to 
research design, but imagination cannot solve every problem. 
Suppose that we are interested in whether children do bet- 
ter in school by virtue of having started school a little older. 
Maybe the 7-year-old brain is better prepared for learning than 
the 6-year-old brain. This question has a policy angle com- 
ing from the fact that, in an effort to boost test scores, some 
school districts are now imposing older start ages (Deming and 
Dynarski, 2008). To assess the effects of delayed school entry 
on learning, we could randomly select some kids to start first 
grade at age 7, while others start at age 6, as is still typical. 
We are interested in whether those held back learn more in 
school, as evidenced by their elementary school test scores. To 
be concrete, let’s look at test scores in first grade. 

The problem with this question—the effects of start age on 
first grade test scores—is that the group that started school at 
age 7 is... older. And older kids tend to do better on tests, a 
pure maturation effect. Now, it might seem we can fix this by 
holding age constant instead of grade. Suppose we wait to test 
those who started at age 6 until second grade and test those 
who started at age 7 in first grade, so that everybody is tested 
at age 7. But the first group has spent more time in school, a 
fact that raises achievement if school is worth anything. There 
is no way to disentangle the effect of start age on learning 
from maturation and time-in-school effects as long as kids are 
still in school. The problem here is that for students, start age 


2 recent example is Bertrand and Mullainathan (2004), who compared 
employers’ reponses to résumés with blacker-sounding and whiter-sounding 
first names, such as Lakisha and Emily (though Fryer and Levitt, 2004, note 
that names may carry information about socioeconomic status as well as race.) 
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equals current age minus time in school. This deterministic 
link disappears in a sample of adults, so we can investigate 
pure start-age effects on adult outcomes, such as earnings or 
highest grade completed (as in Black, Devereux, and Salvanes, 
2008). But the effect of start age on elementary school test 
scores is impossible to interpret even in a randomized trial, 
and therefore, in a word, FUQed. 

The third and fourth research FAQs are concerned with 
the nuts-and-bolts elements that produce a specific study. 
Question number 3 asks, What is your identification strat- 
egy? Angrist and Krueger (1999) used the term identification 
strategy to describe the manner in which a researcher uses 
observational data (i.e., data not generated by a random- 
ized trial) to approximate a real experiment. Returning to 
the schooling example, Angrist and Krueger (1991) used the 
interaction between compulsory attendance laws in American 
states and students’ season of birth as a natural experiment to 
estimate the causal effects of finishing high school on wages 
(season of birth affects the degree to which high school stu- 
dents are constrained by laws allowing them to drop out after 
their 16th birthday). Chapters 3-6 are primarily concerned 
with conceptual frameworks for identification strategies. 

Although a focus on credible identification strategies is 
emblematic of modern empirical work, the juxtaposition of 
ideal and natural experiments has a long history in economet- 
rics. Here is our econometrics forefather, Trygve Haavelmo 
(1944, p. 14), appealing for more explicit discussion of both 
kinds of experimental designs: 


A design of experiments (a prescription of what the physi- 
cists call a “crucial experiment”) is an essential appendix 
to any quantitative theory. And we usually have some such 
experiment in mind when we construct the theories, although— 
unfortunately—most economists do not describe their design 
of experiments explicitly. If they did, they would see that the 
experiments they have in mind may be grouped into two dif- 
ferent classes, namely, (1) experiments that we should like to 
make to see if certain real economic phenomena—when arti- 
ficially isolated from “other influences”—would verify certain 


8 Chapter 1 


hypotheses, and (2) the stream of experiments that Nature is 
steadily turning out from her own enormous laboratory, and 
which we merely watch as passive observers. In both cases 
the aim of the theory is the same, to become master of the 
happenings of real life. 


The fourth research FAQ borrows language from Rubin 
(1991): What is your mode of statistical inference? The answer 
to this question describes the population to be studied, the 
sample to be used, and the assumptions made when construct- 
ing standard errors. Sometimes inference is straightforward, as 
when you use census microdata samples to study the American 
population. Often inference is more complex, however, espe- 
cially with data that are clustered or grouped. The last chapter 
covers practical problems that arise once you’ve answered 
question number 4. Although inference issues are rarely very 
exciting, and often quite technical, the ultimate success of even 
a well-conceived and conceptually exciting project turns on the 
details of statistical inference. This sometimes dispiriting fact 
inspired the following econometrics haiku, penned by Keisuke 
Hirano after completing his thesis: 


T-stat looks too good 
Try clustered standard errors— 
Significance gone 


As should be clear from the above discussion, the four 
research FAQs are part of a process of project development. 
The following chapters are concerned mostly with the econo- 
metric questions that come up after you’ve answered the 
research FAQs—in other words, issues that arise once your 
research agenda has been set. Before turning to the nuts and 
bolts of empirical work, however, we begin with a more 
detailed explanation of why randomized trials give us our 
benchmark. 
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Chapter 2 


The Experimental Ideal 


a 


It is an important and popular fact that things are not always 
what they seem. For instance, on the planet Earth, man had 
always assumed that he was more intelligent than dolphins 
because he had achieved so much—the wheel, New York, 
wars and so on—while all the dolphins had ever done was 
muck about in the water having a good time. But conversely, 
the dolphins had always believed that they were far more 
intelligent than man—for precisely the same reasons. In fact 
there was only one species on the planet more intelligent than 
dolphins, and they spent a lot of their time in behavioral 
research laboratories running round inside wheels and 
conducting frighteningly elegant and subtle experiments on 
man. The fact that once again man completely misinterpreted 
this relationship was entirely according to these creatures’ 
plans. 

Douglas Adams, The Hitchhiker’s Guide to the Galaxy 


he most credible and influential research designs use ran- 

dom assignment. A case in point is the Perry preschool 

project, a 1962 randomized experiment designed to 
assess the effects of an early intervention program involv- 
ing 123 black preschoolers in Ypsilanti, Michigan. The Perry 
treatment group was randomly assigned to an intensive inter- 
vention that included preschool education and home visits. It’s 
hard to exaggerate the impact of the small but well-designed 
Perry experiment, which generated follow-up data through 
1993 on the participants at age 27. Dozens of academic stud- 
ies cite or use the Perry findings (see, e.g., Barnett, 1992). Most 
important, the Perry project provided the intellectual basis for 
the massive Head Start preschool program, begun in 1964, 
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which ultimately served (and continues to serve) millions of 
American children.! 


2.1 The Selection Problem 


We take a brief time-out for a more formal discussion of the 
role experiments play in uncovering causal effects. Suppose 
you are interested in a causal if-then question. To be con- 
crete, let us consider a simple example: Do hospitals make 
people healthier? For our purposes, this question is allegori- 
cal, but it is surprisingly close to the sort of causal question 
health economists care about. To make this question more 
realistic, let’s imagine we’re studying a poor elderly population 
that uses hospital emergency rooms for primary care. Some of 
these patients are admitted to the hospital. This sort of care is 
expensive, crowds hospital facilities, and is, perhaps, not very 
effective (see, e.g., Grumbach, Keane, and Bindman, 1993). In 
fact, exposure to other sick patients by those who are them- 
selves vulnerable might have a net negative impact on their 
health. 

Since those admitted to the hospital get many valuable ser- 
vices, the answer to the hospital effectiveness question still 
seems likely to be yes. But will the data back this up? The 
natural approach for an empirically minded person is to com- 
pare the health status of those who have been to the hospital 
with the health of those who have not. The National Health 
Interview Survey (NHIS) contains the information needed to 
make this comparison. Specifically, it includes a question, 
“During the past 12 months, was the respondent a patient 
in a hospital overnight?” which we can use to identify recent 
hospital visitors. The NHIS also asks, “Would you say your 
health in general is excellent, very good, good, fair, poor?” 


'The Perry data continue to get attention, particularly as policy interest 
has returned to early education. A recent reanalysis by Michael Anderson 
(2008) confirmed many of the findings from the original Perry study, though 
Anderson also shows that the overall positive effects of the Perry project are 
driven entirely by the impact on girls. The Perry intervention seems to have 
done nothing for boys. 
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The following table displays the mean health status (assigning 
a 1 to poor health and a 5 to excellent health) among those 
who have been hospitalized and those who have not (tabulated 
from the 2005 NHIS): 


Group Sample Size Mean Health Status Std. Error 
Hospital 7,774 3.21 0.014 
No hospital 90,049 3.93 0.003 


The difference in means is 0.72, a large and highly significant 
contrast in favor of the nonhospitalized, with a t-statistic of 
58.9. 

Taken at face value, this result suggests that going to the 
hospital makes people sicker. Its not impossible this is the 
right answer since hospitals are full of other sick people who 
might infect us and dangerous machines and chemicals that 
might hurt us. Still, it’s easy to see why this comparison should 
not be taken at face value: people who go to the hospital are 
probably less healthy to begin with. Moreover, even after hos- 
pitalization people who have sought medical care are not as 
healthy, on average, as those who were never hospitalized in 
the first place, though they may well be better off than they 
otherwise would have been. 

To describe this problem more precisely, we can think about 
hospital treatment as described by a binary random variable, 
D; = {0,1}. The outcome of interest, a measure of health sta- 
tus, is denoted by y;. The question is whether y; is affected 
by hospital care. To address this question, we assume we can 
imagine what might have happened to someone who went to 
the hospital if that person had not gone, and vice versa. Hence, 
for any individual there are two potential health variables: 


å Y1; if D; = 1 
Potential outcome = i . 
Yo; if D; = 0 
In other words, Yo; is the health status of an individual had he 
not gone to the hospital, irrespective of whether he actually 
went, while Y1; is the individual’s health status if he goes. We 
would like to know the difference between Y1; and Yo;, which 
can be said to be the causal effect of going to the hospital for 
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individual i. This is what we would measure if we could go 
back in time and change a person’s treatment status.” 

The observed outcome, y;, can be written in terms of 
potential outcomes as 


y= Y1; if D; = 1 
S Yo; if D; = 0 
= Yy + (¥1; — Yo;) Dj. (2.1.1) 


This notation is useful because yj; — Yo; is the causal effect 
of hospitalization for an individual. In general, there is likely 
to be a distribution of both Y1; and Yo; in the population, so 
the treatment effect can be different for different people. But 
because we never see both potential outcomes for any one 
person, we must learn about the effects of hospitalization by 
comparing the average health of those who were and were not 
hospitalized. 

A naive comparison of averages by hospitalization status 
tells us something about potential outcomes, though not nec- 
essarily what we want to know. The comparison of average 
health conditional on hospitalization status is formally linked 
to the average causal effect by the equation: 


Efy;|D; = 1] — Efyi|D; = 0] = Efyi;|D; = 1] — Efvoi|Di = 1] 
Observed difference in average health Average treatment effect on the treated 
+ EfyoilDi = 1] — E[Yo;|D; = 0]. 
Selection bias 


The term 
Efy,;|D; = 1] — Efyo;|D; = 1] = Elyy; — Yo;lD; = 1] 


is the average causal effect of hospitalization on those who 
were hospitalized. This term captures the averages difference 
between the health of the hospitalized, E[Y1;|D; = 1], and what 
would have happened to them had they not been hospitalized, 


?The potential outcomes idea is a fundamental building block in modern 
research on causal effects. Important references developing this idea are Rubin 
(1974, 1977) and Holland (1986), who refers to a causal framework involving 
potential outcomes as the Rubin causal model. 
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E[yo;|D; = 1]. The observed difference in health status, how- 
ever, adds to this causal effect a term called selection bias. 
This term is the difference in average Yo; between those who 
were and those who were not hospitalized. Because the sick 
are more likely than the healthy to seek treatment, those who 
were hospitalized have worse values of Yo;, making selection 
bias negative in this example. The selection bias may be so 
large (in absolute value) that it completely masks a positive 
treatment effect. The goal of most empirical economic research 
is to overcome selection bias, and therefore to say something 
about the causal effect of a variable like p;.° 


2.2 Random Assignment Solves the Selection Problem 


Random assignment of D; solves the selection problem because 
random assignment makes Dp; independent of potential out- 
comes. To see this, note that 


Efy;|pD; = 1] — Efy;|pD; = 0] = Efy1;|D; = 1] — E[vyo;|D; = 0] 
= Efyi;|D; = 1] — Efyo,|D; = 1], 


where the independence of yo; and pD; allows us to swap 
E[yo;|D; = 1] for E[yo,;|D; = 0] in the second line. In fact, given 
random assignment, this simplifies further to 


Efyi;|D; = 1] — EfYo;|D; = 1] = Ely; — Yo;|D; = 1] 
= Elvi; —Yoil. 


The effect of randomly assigned hospitalization on the hos- 
pitalized is the same as the effect of hospitalization on a 
randomly chosen patient. The main thing, however, is that 
random assignment of D; eliminates selection bias. This does 
not mean that randomized trials are problem-free, but in prin- 
ciple they solve the most important problem that arises in 
empirical research. 


3This section marks our first use of the conditional expectation operator 
(e.g., Ely;|D; = 1] and E[y;|D; = 0]). We use this to denote the population 
(or infinitely large sample) average of one random variable with the value 
of another held fixed. A more formal and detailed definition appears in 
Chapter 3. 
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How relevant is our hospitalization allegory? Experiments 
often reveal things that are not what they seem on the basis of 
naive comparisons alone. A recent example from medicine is 
the evaluation of hormone replacement therapy (HRT). This 
is a medical intervention that was recommended for middle- 
aged women to reduce menopause symptoms. Evidence from 
the Nurses Health Study, a large and influential nonexperi- 
mental survey of nurses, showed better health among HRT 
users. In contrast, the results of a recently completed random- 
ized trial showed few benefits of HRT. Worse, the randomized 
trial revealed serious side effects that were not apparent in the 
nonexperimental data (see, e.g., Women’s Health Initiative 
[WHI], Hsia et al., 2006). 

An iconic example from our own field of labor economics is 
the evaluation of government-subsidized training programs. 
These are programs that provide a combination of class- 
room instruction and on-the-job training for groups of dis- 
advantaged workers such as the long-term unemployed, drug 
addicts, and ex-offenders. The idea is to increase employment 
and earnings. Paradoxically, studies based on nonexperimen- 
tal comparisons of participants and nonparticipants often 
show that after training, the trainees earn less than plausible 
comparison groups (see, e.g., Ashenfelter, 1978; Ashenfelter 
and Card, 1985; Lalonde 1995). Here, too, selection bias is a 
natural concern, since subsidized training programs are meant 
to serve men and women with low earnings potential. Not 
surprisingly, therefore, simple comparisons of program par- 
ticipants with nonparticipants often show lower earnings for 
the participants. In contrast, evidence from randomized eval- 
uations of training programs generate mostly positive effects 
(see, e.g., Lalonde, 1986; Orr et al., 1996). 

Randomized trials are not yet as common in social science 
as in medicine, but they are becoming more prevalent. One 
area where the importance of random assignment is growing 
rapidly is education research (Angrist, 2004). The 2002 Edu- 
cation Sciences Reform Act passed by the U.S. Congress man- 
dates the use of rigorous experimental or quasi-experimental 
research designs for all federally funded education studies. We 
can therefore expect to see many more randomized trials in 
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education research in the years to come. A pioneering ran- 
domized study from the field of education is the Tennessee 
STAR experiment, designed to estimate the effects of smaller 
classes in primary school. 

Labor economists and others have a long tradition of try- 
ing to establish causal links between features of the classroom 
environment and children’s learning, an area of investigation 
that we call “education production.” This terminology reflects 
the fact that we think of features of the school environment as 
inputs that cost money, while the output that schools produce 
is student learning. A key question in research on education 
production is which inputs produce the most learning given 
their costs. One of the most expensive inputs is class size, since 
smaller classes can only be achieved by hiring more teachers. It 
is therefore important to know whether the expense of smaller 
classes has a payoff in terms of higher student achievement. 
The STAR experiment was meant to answer this question. 

Many studies of education production using nonexperimen- 
tal data suggest there is little or no link between class size and 
student learning. So perhaps school systems can save money 
by hiring fewer teachers, with no consequent reduction in 
achievement. The observed relation between class size and 
student achievement should not be taken at face value, how- 
ever, since weaker students are often deliberately grouped into 
smaller classes. A randomized trial overcomes this problem by 
ensuring that we are comparing apples to apples, that is, that 
the students assigned to classes of different sizes are otherwise 
comparable. Results from the Tennessee STAR experiment 
point to a strong and lasting payoff to smaller classes (see 
Finn and Achilles, 1990, for the original study, and Krueger, 
1999, for an econometric analysis of the STAR data). 

The STAR experiment was unusually ambitious and influ- 
ential, and therefore worth describing in some detail. It cost 
about $12 million and was implemented for a cohort of kinder- 
gartners in 1985-86. The study ran for four years, until the 
original cohort of kindergartners was in third grade, and 
involved about 11,600 children. The average class size in 
regular Tennessee classes in 1985-86 was about 22.3. The 
experiment assigned students to one of three treatments: small 
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classes with 13-17 children, regular classes with 22-25 chil- 
dren and a part-time teacher’s aide (the usual arrangement), 
or regular classes with a full-time teacher’s aide. Schools with 
at least three classes in each grade could choose to participate 
in the experiment. 

The first question to ask about a randomized experiment 
is whether the randomization successfully balanced subjects’ 
characteristics across the different treatment groups. To assess 
this, it’s common to compare pretreatment outcomes or other 
covariates across groups. Unfortunately, the STAR data fail 
to include any pretreatment test scores, though it is possi- 
ble to look at characteristics of children such as race and 
age. Table 2.2.1, reproduced from Krueger (1999), compares 
the means of these variables. The student characteristics in the 
table are a free lunch variable, student race, and student age. 
Free lunch status is a good measure of family income, since 
only poor children qualify for a free school lunch. Differences 
in these characteristics across the three class types are small, 
and none is significantly different from zero, as indicated 
by the p-values in the last column. This suggests the random 
assignment worked as intended. 

Table 2.2.1 also presents information on average class size, 
the attrition rate, and test scores, measured here on a per- 
centile scale. The attrition rate (proportion of students lost to 
follow-up) was lower in small kindergarten classrooms. This is 
potentially a problem, at least in principle.* Class sizes are sig- 
nificantly lower in the assigned-to-be-small classrooms, which 
means that the experiment succeeded in creating the desired 
variation. If many of the parents of children assigned to regu- 
lar classes had successfully lobbied teachers and principals to 
get their children assigned to small classes, the gap in class size 
across groups would be much smaller. 

Because randomization eliminates selection bias, the differ- 
ence in outcomes across treatment groups captures the average 


4Krueger (1999) devotes considerable attention to the attrition problem. 
Differences in attrition rates across groups may result in a sample of stu- 
dents in higher grades that is not randomly distributed across class types. The 
kindergarten results, which were unaffected by attrition, are therefore the 
most reliable. 


The Experimental Ideal 19 


TABLE 2.2.1 
Comparison of treatment and control characteristics in the Tennessee 
STAR experiment 

Class Size P-value for equality 
Variable Small Regular Regular/Aide across groups 
Free lunch 47 48 50 09 
White/Asian .68 67 66 26 
Age in 1985 5.44 5.43 5.42 732: 
Attrition rate 49 52 53 02 
Class size in 15.10 22.40 22.80 .00 
kindergarten 
Percentile scorein 54.70 48.90 50.00 00 
kindergarten 


Notes: Adapted from Krueger (1999), table I. The table shows means of variables 
by treatment status for the sample of students who entered STAR in kindergarten. 
The P-value in the last column is for the F-test of equality of variable means across 
all three groups. The free lunch variable is the fraction receiving a free lunch. 
The percentile score is the average percentile score on three Stanford Achievement 
Tests. The attrition rate is the proportion lost to follow-up before completing third 
grade. 


causal effect of class size (relative to regular classes with a 
part-time aide). In practice, the difference in means between 
treatment and control groups can be obtained from a regres- 
sion of test scores on dummies for each treatment group, a 
point we expand on below. Regression estimates of treatment- 
control differences for kindergartners, reported in table 2.2.2 
(derived from Krueger, 1999, table V), show a small-class 
effect of about five percentile points (other rows in the table 
show coefficients on control variables in the regressions). The 
effect size is about .20, where o is the standard deviation of 
the percentile score in kindergarten. The small-class effect is 
significantly different from zero, while the regular/aide effect 
is small and insignificant. 

The STAR study, an exemplary randomized trial in the 
annals of social science, also highlights the logistical difficulty, 
long duration, and potentially high cost of randomized trials. 
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TABLE 2.2.2 

Experimental estimates of the effect of class size on test scores 
Explanatory Variable (1) (2) (3) (4) 
Small class 4.82 5.37 5.36 5.37 
(2.19) (1.26) (1.21 (1.19 

Regular/aide class 12 29 53 31 
(2.23) (1.13) (1.09 (1.07 

White/Asian — — 8.35 8.44 
(1.35 (1.36 

Girl — — 4.48 4.39 
(.63 (.63 
Free lunch — — —13.15 —13.07 
(.77 (.77 
White teacher — — — —.57 
(2.10 

Teacher experience = — — .26 
(.10) 

Teacher Master’s degree — — — —0.51 
(1.06) 

School fixed effects No Yes Yes Yes 
R? 01 25 .31 .31 


Notes: Adapted from Krueger (1999), table V. The dependent variable is the 
Stanford Achievement Test percentile score. Robust standard errors allowing 
for correlated residuals within classes are shown in parentheses. The sample 
size is 5,681. 


In many cases, such trials are impractical.’ In other cases, 
we would like an answer sooner rather than later. Much of 


>Randomized trials are never perfect, and STAR is no exception. Pupils 
who repeated or skipped a grade left the experiment. Students who entered 
an experimental school one grade later were added to the experiment and 
randomly assigned to one of the classes. One unfortunate aspect of the exper- 
iment is that students in the regular and regular/aide classes were reassigned 
after the kindergarten year, possibly because of protests by the parents with 
children in the regular classrooms. There was also some switching of children 
after the kindergarten year. But Krueger’s (1999) analysis suggests that none 
of these implementation problems affected the main conclusions of the study. 
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the research we do, therefore, attempts to exploit cheaper 
and more readily available sources of variation. We hope to 
find natural or quasi-experiments that mimic a randomized 
trial by changing the variable of interest while other factors 
are kept balanced. Can we always find a convincing natural 
experiment? Of course not. Nevertheless, we take the position 
that a notional randomized trial is our benchmark. Not all 
researchers share this view, but many do. We heard it first from 
our teacher and thesis advisor, Orley Ashenfelter, a pioneer- 
ing proponent of experiments and quasi-experimental research 
designs in social science. Here is Ashenfelter (1991) assessing 
the credibility of the observational studies linking schooling 
and income: 


How convincing is the evidence linking education and income? 
Here is my answer: Pretty convincing. If I had to bet on what 
an ideal experiment would indicate, I bet that it would show 
that better educated workers earn more. 


The quasi-experimental study of class size by Angrist and 
Lavy (1999) illustrates the manner in which nonexperimental 
data can be analyzed in an experimental spirit. The Angrist and 
Lavy study relied on the fact that in Israel, class size is capped 
at 40. Therefore, a child in a fifth grade cohort of 40 students 
ends up in a class of 40 while a child in a fifth grade cohort 
of 41 students ends up in a class only half as large because the 
cohort is split. Since students in cohorts of size 40 and 41 are 
likely to be similar on other dimensions, such as ability and 
family background, we can think of the difference between 
40 and 41 students enrolled as being “as good as randomly 
assigned.” 

The Angrist-Lavy study compared students in grades with 
enrollments above and below bureaucratic class size cutoffs 
to construct well-controlled estimates of the effects of a sharp 
change in class size without the benefit of a real experiment. 
As in the Tennessee STAR study, the Angrist and Lavy (1999) 
results pointed to a strong link between class size and achieve- 
ment. This was in marked contrast to naive analyses, also 
reported by Angrist and Lavy, based on simple comparisons 
between those enrolled in larger and smaller classes. These 
comparisons showed students in smaller classes doing worse 
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on standardized tests. The hospital allegory of selection bias 
would therefore seem to apply to the class size question as 
well.° 


2.3 Regression Analysis of Experiments 


Regression is a useful tool for the study of causal questions, 
including the analysis of data from experiments. Suppose (for 
now) that the treatment effect is the same for everyone, say 
Y1; —Yo; = p, a constant. With constant treatment effects, we 
can rewrite (2.1.1) in the form 


Y= Oo + p D; + Nis 
ll ll ll 
E(Yo;) (Y1i — Yo;) Yo; — E(Yoi), 
(2.3.1) 


where n; is the random part of yo;. Evaluating the conditional 
expectation of this equation with treatment status switched off 
and on gives 


E[y;|Di = 1] = æ + p + Ef[n;|D; = 1] 
E[y;|D; = 0] = «æ + E[n;|D; = 0], 
so that 
E[y;|D; = 1] — EfY;|D; = 0] = p 
Treatment effect 
+ E[nj|b; = 1] — E[n;|D; = 0]. 
Selection bias 
Thus, selection bias amounts to correlation between the 
regression error term, n;, and the regressor, Dj. Since 


E[n;|D; = 1] — Efn;|D; = 0] = Efyo;|D; = 1] — E[Yo;|D; = 0], 


this correlation reflects the difference in (no-treatment) poten- 
tial outcomes between those who get treated and those who 


6The Angrist-Lavy (1999) results turn up again in chapter 6, as an illustra- 
tion of the quasi-experimental regression-discontinuity research design. 
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don’t. In the hospital allegory, those who were treated had 
poorer health outcomes in the no-treatment state, while in 
the Angrist and Lavy (1999) study, students in smaller classes 
tended to have intrinsically lower test scores. 

In the STAR experiment, where D; is randomly assigned, 
the selection bias term disappears, and a regression of Y; on D; 
estimates the causal effect of interest, p. Table 2.2.2 shows 
different regression specifications, some of which include 
covariates other than the random assignment indicator, Dj. 
Covariates play two roles in regression analyses of experimen- 
tal data. First, the STAR experimental design used conditional 
random assignment. In particular, assignment to classes of dif- 
ferent sizes was random within schools but not across schools. 
Students attending schools of different types (say, urban versus 
rural) were a bit more or less likely to be assigned to a small 
class. The comparison in column 1 of table 2.2.2, which makes 
no adjustment for this, might therefore be contaminated by dif- 
ferences in achievement in schools of different types. To adjust 
for this, some of Krueger’s regression models include school 
fixed effects, that is, a separate intercept for each school in 
the STAR data. In practice, the consequences of adjusting for 
school fixed effects is rather minor, but we wouldn’t know this 
without taking a look. We have more to say about regression 
models with fixed effects in chapter 5. 

The other controls in Krueger’s table describe student char- 
acteristics such as race, age, and free lunch status. We saw 
before that these individual characteristics are balanced across 
class types, that is, they are not systematically related to the 
class size assignment of the student. If these controls, call them 
X;, are uncorrelated with the treatment D;, then they will not 
affect the estimate of p. In other words, estimates of p in the 
long regression, 


Y; =a+ pD; + Xjy + ni, (2.3.2) 


will be close to estimates of p in the short regression, (2.3.1). 
This is a point we expand on in chapter 3. 

Inclusion of the variables X;, although not necessary in this 
case, may generate more precise estimates of the causal effect 
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of interest. Notice that the standard error of the estimated 
treatment effects in column 3 is smaller than the correspond- 
ing standard error in column 2. Although the control variables, 
X;, are uncorrelated with p;, they have substantial explana- 
tory power for y;. Including these control variables therefore 
reduces the residual variance, which in turn lowers the stan- 
dard error of the regression estimates. Similarly, the standard 
errors of the estimates of p are reduced by the inclusion of 
school fixed effects because these too explain an important 
part of the variance in student performance. The last column 
adds teacher characteristics. Because teachers were randomly 
assigned to classes, and teacher characteristics have little to 
do with student achievement in these data, both the estimated 
effect of small classes and its standard error are unchanged by 
the addition of teacher variables. 

Regression plays an exceptionally important role in empiri- 
cal economic research. As we’ve seen in this chapter, regression 
is well-suited to the analysis of experimental data. In some 
cases, regression can also be used to approximate experiments 
in the absence of random assignment. But before we get into 
the important question of when a regression is likely to have 
a causal interpretation, it is useful to review a number of 
fundamental regression facts and properties. These facts and 
properties are reliably true for any regression, regardless of the 
motivation for running it. 


Part II 


The Core 


$/R, THE FIT 15 
GOOD, BUT THE 
INTERPRETATION 15 
OUTLANDISH! 


Chapter 3 


Making Regression Make Sense 


al 


“Let us think the unthinkable, let us do the undoable. 
Let us prepare to grapple with the ineffable itself, 
and see if we may not eff it after all.” 

Douglas Adams, Dirk Gently’s Holistic Detective Agency 


Angrist recounts: 


I ran my first regression in the summer of 1979 between my 
freshman and sophomore years as a student at Oberlin College. 
I was working as a research assistant for Allan Meltzer and 
Scott Richard, faculty members at Carnegie-Mellon University, 
near my house in Pittsburgh. I was still mostly interested in a 
career in special education, and had planned to go back to 
work as an orderly in a state mental hospital, my previous 
summer job. But Econ 101 had got me thinking, and I could 
also see that at the same wage rate, a research assistant’s hours 
and working conditions were better than those of a hospital 
orderly. My research assistant duties included data collection 
and regression analysis, though I did not understand regression 
or even statistics at the time. 

The paper I was working on that summer (Meltzer and 
Richard, 1983) is an attempt to link the size of governments 
in democracies, measured as government expenditure over 
GDP, to income inequality. Most income distributions have 
a long right tail, which means that average income tends to be 
way above the median. When inequality grows, more voters 
find themselves with below-average incomes. Annoyed by this, 
those with incomes between the median and the average may 
join those with incomes below the median in voting for fiscal 
policies that take from the rich and give to the poor. The size 
of government consequently increases. 
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I absorbed the basic theory behind the Meltzer and Richard 
project, though I didn’t find it all that plausible, since voter 
turnout is low for the poor. I also remember arguing with 
my bosses over whether government expenditure on education 
should be classified as a public good (something that bene- 
fits everyone in society as well as those directly affected) or a 
private good publicly supplied, and therefore a form of redis- 
tribution like welfare. You might say this project marked the 
beginning of my interest in the social returns to education, a 
topic I went back to with more enthusiasm and understanding 
in Acemoglu and Angrist (2000). 

Today, I understand the Meltzer and Richard study as an 
attempt to use regression to uncover and quantify an interesting 
causal relation. At the time, however, I was purely a regression 
mechanic. Sometimes I found the RA work depressing. Days 
would go by when I didn’t talk to anybody but my bosses and 
the occasional Carnegie-Mellon Ph.D. student, most of whom 
spoke little English anyway. The best part of the job was lunch 
with Allan Meltzer, a distinguished scholar and a patient and 
good-natured supervisor, who was happy to chat while we ate 
the contents of our brown bags (this did not take long, as Allan 
ate little and I ate fast). Once I asked Allan whether he found it 
satisfying to spend his days perusing regression output, which 
then came on reams of double-wide green-bar paper. Meltzer 
laughed and said there was nothing he would rather be doing. 


Now we too spend our days happily perusing regression 
output, in the manner of our teachers and advisers in college 
and graduate school. This chapter explains why. 


3.1 Regression Fundamentals 


The end of the previous chapter introduced regression models 
as a computational device for the estimation of treatment- 
control differences in an experiment, with and without covari- 
ates. Because the regressor of interest in the class size study 
discussed in section 2.3 was randomly assigned, the result- 
ing estimates have a causal interpretation. In most studies, 
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however, regression is used with observational data. Without 
the benefit of random assignment, regression estimates may 
or may not have a causal interpretation. We return to the cen- 
tral question of what makes a regression causal later in this 
chapter. 

Setting aside the relatively abstract causality problem for the 
moment, we start with the mechanical properties of regres- 
sion estimates. These are universal features of the population 
regression vector and its sample analog that have nothing to do 
with a researcher’s interpretation of his output. These prop- 
erties include the intimate connection between the population 
regression function and the conditional expectation function 
and the sampling distribution of regression estimates. 


3.1.1 Economic Relationships and 
the Conditional Expectation Function 


Empirical economic research in our field of labor economics is 
typically concerned with the statistical analysis of individual 
economic circumstances, and especially differences between 
people that might account for differences in their economic 
fortunes. Differences in economic fortune are notoriously hard 
to explain; they are, in a word, random. As applied econome- 
tricians, however, we believe we can summarize and interpret 
randomness in a useful way. An example of “systematic ran- 
domness” mentioned in the introduction is the connection 
between education and earnings. On average, people with 
more schooling earn more than people with less schooling. 
The connection between schooling and earnings has consid- 
erable predictive power, in spite of the enormous variation 
in individual circumstances that sometimes clouds this fact. 
Of course, the fact that more educated people tend to earn 
more than less educated people does not mean that school- 
ing causes earnings to increase. The question of whether 
the earnings-schooling relationship is causal is of enormous 
importance, and we come back to it many times. Even with- 
out resolving the difficult question of causality, however, it’s 
clear that education predicts earnings in a narrow statistical 


30 Chapter 3 


sense. This predictive power is compellingly summarized by 
the conditional expectation function (CEF). 

The CEF for a dependent variable y;, given a Kx 1 vec- 
tor of covariates X; (with elements x,;), is the expectation, or 
population average, of y;, with X; held fixed. The population 
average can be thought of as the mean in an infinitely large 
sample, or the average in a completely enumerated finite pop- 
ulation. The CEF is written E[y;|X;] and is a function of X;. 
Because X; is random, the CEF is random, though sometimes 
we work with a particular value of the CEF, say E[y;|X; = 42], 
assuming 42 is a possible value for X;. In chapter 2, we briefly 
considered the CEF E[y;|D;], where D; is a zero-one variable. 
This CFF takes on two values, E[y;|D; = 1] and E[y;|p; = 0]. 
Although this special case is important, we are most often 
interested in CEFs that are functions of many variables, con- 
veniently subsumed in the vector X;. For a specific value of 
X;, say X; = x, we write E[y;|[X; = x]. For continuous yY; with 
conditional density f,(t|X; = x) at Y; = t, the CFF is 


Efy;|X; = x] = i tf,(t|X; = x)dt. 


If y; is discrete, E[y;|X; = x] equals the sum }_, tP(y; = t|X; = 
x), where P(y; = t|X; = x) is the conditional probability mass 
function for Y; given X; = x. 

Expectation is a population concept. In practice, data usu- 
ally come in the form of samples and rarely consist of an 
entire population. We therefore use samples to make infer- 
ences about the population. For example, the sample CEF 
is used to learn about the population CEF. This is necessary 
and important, but we postpone a discussion of the formal 
inference step taking us from sample to population until sec- 
tion 3.1.3. Our “population-first” approach to econometrics 
is motivated by the fact that we must define the objects of 
interest before we can use data to study them.! 


'Examples of pedagogical writing using the “population-first” approach to 
econometrics include Chamberlain (1984), Goldberger (1991), and Manski 
(1991). 
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Figure 3.1.1 Raw data and the CEF of average log weekly wages 
given schooling. The sample includes white men aged 40-49 in the 
1980 IPUMS 5 percent file. 


Figure 3.1.1 plots the CEF of log weekly wages given school- 
ing for a sample of middle-aged white men from the 1980 
census. The distribution of earnings is also plotted for a few 
key values: 4, 8, 12, and 16 years of schooling. The CEF in 
the figure captures the fact that, notwithstanding the enormous 
variation individual circumstances, people with more school- 
ing generally earn more. The average earnings gain associated 
with a year of schooling is typically about 10 percent. 

An important complement to the CEF is the law of iterated 
expectations. This law says that an unconditional expectation 
can be written as the unconditional average of the CEF. In 
other words, 


Ely;] = E{E[y;|Xi}}, (3.1.1) 


where the outer expectation uses the distribution of X;. Here 
is a proof of the law of iterated expectations for continuously 
distributed (Xj, Y;) with joint density fxy(u, t), where f,(¢|X; = 
u) is the conditional distribution of y; given X; = u and g,(t) 
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and g,(u) are the marginal densities: 


E{E[y;[X;]} = J Evax: = ulg,(u)du 


= / p tf (tIX; = udt! gx(u)du 


= J [60X = gxtuduat 


z f| feex = West dt 
= f | | fosu, t)du| dt 


= J tested = Erva. 


The integrals in this derivation run over the possible values 
of X; and y; (indexed by u and t). We’ve laid out these steps 
because the CEF and its properties are central to the rest of 
this chapter.? 

The power of the law of iterated expectations comes from 
the way it breaks a random variable into two pieces, the CEF 
and a residual with special properties. 


Theorem 3.1.1 The CEF Decomposition Property. 
Y; = E[Y;|X;] + £;, 


where (i) s; is mean independent of X;, that is, E[e;|X;] = 0, 
and therefore (ii) s; is uncorrelated with any function of Xj. 


Proof. (i) Ele;|X;] = Ely; — Ely;|Xi]|Xi] = Ely;|Xi] — Elyil 
X;] = 0. (ii) Let b(X;) be any function of X;. By the law of 
iterated expectations, E[h(X;)e;] = E{h(X;)E[e;|X;]}, and by 
mean independence, E[e;|X;] = 0. 


2A simple example illustrates how the law of iterated expectations works: 
Average earnings in a population of men and women is the average for men 
times the proportion male in the population plus the average for women times 
the proportion female in the population. 
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This theorem says that any random variable y; can be 
decomposed into a piece that is “explained by X;”—that is, 
the CEF—and a piece left over that is orthogonal to (i.e., 
uncorrelated with) any function of X;. 

The CEF is a good summary of the relationship between y; 
and X;, for a number of reasons. First, we are used to thinking 
of averages as providing a representative value for a random 
variable. More formally, the CEF is the best predictor of y; 
given X; in the sense that it solves a minimum mean squared 
error (MMSE) prediction problem. This CEF prediction prop- 
erty is a consequence of the CEF decomposition property: 


Theorem 3.1.2 The CEF Prediction Property. 
Let m(X;) be any function of X;. The CEF solves 


Efy;|X;] = arg min E[(y; — m(X;))*], 
m(X;) 


so it is the MMSE predictor of Y; given X;. 


Proof. Write 
(vj — m(Xj))? = ((vi — Elyi|Xi]) + (ElyilXi] — m(Xj)))? 
= (vi — Ely; [Xj])? + 2(Elyi|Xi] — m(Xi)) 
x (vj — Elyi|Xi]) + (ElvilXi] — m(X;))’. 
The first term doesn’t matter because it doesn’t involve (X;). 
The second term can be written h(X;)e;, where h(X;) = 
2(E[y;|X;] — m(X;)), and therefore has expectation zero by the 


CEF decomposition property. The last term is minimized at 
zero when m(X;) is the CEF. 


A final property of the CEF, closely related to both the 
decomposition and prediction properties, is the analysis of 
variance (ANOVA) theorem: 


Theorem 3.1.3 The ANOVA Theorem. 
Vivi) = V(ETy;|Xi]) + EIV (Y;|X;)], 


where V(-) denotes variance and V(y;|X;) is the conditional 
variance of Y; given Xj. 


34 Chapter 3 


Proof. The CEF decomposition property implies the variance 
of yY; is the variance of the CEF plus the variance of the residual, 
e; = Y; — Ef[y;|X;,], since £; and E[y;|X,] are uncorrelated. The 
variance of ¢; is 


Ele?] = ElEle7|Xi}] = ELViyi|Xill, 


where E[e?|X;] = V[y;|X;] because c; = y; — Efy;|Xi]. 


The two CEF properties and the ANOVA theorem may 
have a familiar ring. You might be used to seeing an ANOVA 
table in your regression output, for example. ANOVA is also 
important in research on inequality, where labor economists 
decompose changes in the income distribution into parts that 
can be accounted for by changes in worker characteristics and 
changes in what’s left over after accounting for these factors 
(see, e.g., Autor, Katz, and Kearney, 2005). What may be 
unfamiliar is the fact that the CEF properties and ANOVA 
variance decomposition work in the population as well as in 
samples, and do not turn on the assumption of a linear CEF. 
In fact, the validity of linear regression as an empirical tool 
does not turn on linearity either. 


3.1.2 Linear Regression and the CEF 


So what’s the regression you want to run? In our world, this 
question or one like it is heard almost every day. Regression 
estimates provide a valuable baseline for almost all empirical 
research because regression is tightly linked to the CEF, and the 
CEF provides a natural summary of empirical relationships. 
The link between regression functions—that is, the best-fitting 
line generated by minimizing expected squared errors—and 
the CEF can be explained in at least three ways. To lay out 
these explanations precisely, it helps to be precise about the 
regression function we have in mind. This section is con- 
cerned with the vector of population regression coefficients, 
defined as the solution to a population least squares problem. 
At this point we are not worried about causality. Rather, 


Making Regression Make Sense 35 


we let the K x 1 regression coefficient vector B be defined by 
solving 


B = arg min E[(y; — Xb)*]. (3.1.2) 
b 


Using the first-order condition, 
E[X;(y; — X;b)] = 0, 


the solution can be written 6 = E[X;X‘]~'E[X,y;]. Note that 
by construction, E[X;(y;—X/8)] = 0. In other words, the 
population residual, which we define as y;—X/B = e;, is 
uncorrelated with the regressors, X;. It bears emphasizing that 
this error term does not have a life of its own. It owes its exis- 
tence and meaning to £. We return to this important point in 
the discussion of causal regression in section 3.2. 

In the simple bivariate case where the regression vector 
includes only the single regressor, x;, and a constant, the 
slope coefficient is By = cree, and the intercept is a = E[y;] 
— B E[X;]. In the multivariate case, with more than one non- 
constant regressor, the slope coefficient for the kth regressor 
is given below: 


REGRESSION ANATOMY 


_ Cov(¥i, Žki) 
bk = Vika)” (3.1.3) 


where Xx; is the residual from a regression of x,; on all the 
other covariates. 

In other words, E[X;X‘]~'E[X;y;] is the K x 1 vector with 
kth element Covini) This important formula is said to 
describe the anatomy of a multivariate regression coefficient 
because it reveals much more than the matrix formula 6 = 
E[X;X‘]-E[X,y;]. It shows us that each coefficient in a multi- 
variate regression is the bivariate slope coefficient for the corre- 
sponding regressor after partialing out all the other covariates. 
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To verify the regression anatomy formula, substitute 
Y; = + Byxqj +--+ + Boxes +--+ + Baxi + ei 


in the numerator of (3.1.3). Since X,; is a linear combination 
of the regressors, it is uncorrelated with e;. Also, since X,; is 
a residual from a regression on all the other covariates in the 
model, it must be uncorrelated with these covariates. Finally, 
for the same reason, the covariance of Xp; with xx; is just the 
variance of X,;. We therefore have Cov(y;, Xg;) = Be V(Xzi).? 

The regression anatomy formula is probably familiar to you 
from a regression or statistics course, perhaps with one twist: 
the regression coefficients defined in this section are not esti- 
mators; rather, they are nonstochastic features of the joint 
distribution of dependent and independent variables. This 
joint distribution is what you would observe if you had a 
complete enumeration of the population of interest (or knew 
the stochastic process generating the data). You probably 
don’t have such information. Still, it’s good empirical prac- 
tice to think about what population parameters mean before 
worrying about how to estimate them. 

Below we discuss three reasons why the vector of popula- 
tion regression coefficients might be of interest. These reasons 
can be summarized by saying that you should be interested in 
regression parameters if you are interested in the CEF. 


3The regression anatomy formula is usually attributed to Frisch and Waugh 
(1933). You can also do regression anatomy this way: 


where ¥;; is the residual from a regression of Y; on every covariate except xpi- 
This works because the fitted values removed from ¥,; are uncorrelated with 
Xp; Often it’s useful to plot ¥,; against x,;; the slope of the least squares fit in 
this scatterplot is the multivariate £}, even though the plot is two-dimensional. 
Note, however, that it’s not enough to partial the other covariates out of Y; 
only. That is, 


Cou(¥ pi, Xpi) [eesti | eed | B 
Vien) L Væ) Vio) |?” 


unless xp; is uncorrelated with the other covariates. 
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Theorem 3.1.4 The Linear CEF Theorem (Regression Justifi- 
cation I). 

Suppose the CEF is linear. Then the population regression 
function is it. 


Proof. Suppose E[y;|X;] = X;8* for a Kx 1 vector of coef- 
ficients, 8*. Recall that E[X;(y; — E[y;|X;])] = 0 by the CEF 
decomposition property. Substitute using E[y;|X;] = X;6* to 
find that Br = E[X;X{]-'ELXwi] = B. 


The linear CEF theorem raises the question of what makes 
a CEF linear. The classic scenario is joint normality, that is, 
the vector (y;, X; has a multivariate normal distribution. This 
is the scenario considered by Galton (1886), father of regres- 
sion, who was interested in the intergenerational link between 
normally distributed traits such as height and intelligence. 
The normal case is clearly of limited empirical relevance since 
regressors and dependent variables are often discrete, while 
normal distributions are continuous. Another linearity sce- 
nario arises when regression models are saturated. As reviewed 
in section 3.1.4, a saturated regression model has a separate 
parameter for every possible combination of values that the set 
of regressors can take on. For example a saturated regression 
model with two dummy covariates includes both covariates 
(with coefficients known as the main effects) and their prod- 
uct (known as an interaction term). Such models are inherently 
linear, a point we also discuss in section 3.1.4. 

The following two reasons for focusing on regression are 
relevant when the linear CEF theorem does not apply. 


Theorem 3.1.5 The Best Linear Predictor Theorem (Regres- 
sion Justification II). 

The function Xf is the best linear predictor of Y; given X; 
ina MMSE sense. 


Proof. B = E[X;X/] 'E[X;Y;] solves the population least 
squares problem, (3.1.2). 


In other words, just as the CEF, E[y;|X;], is the best (i.e., 
MMSE) predictor of y; given X; in the class of all functions of 
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X;, the population regression function is the best we can do in 
the class of linear functions. 


Theorem 3.1.6 The Regression CEF Theorem (Regression 
Justification III). 

The function X‘B provides the MMSE linear approximation 
to E[y;|X;], that is, 


b = arg min E{(E[y;|X;] — X/b)*}. (3.1.4) 
b 


Proof. Start by observing that 8 solves (3.1.2). Write 
(yj — Xjb)* = {(y; — Elvs|Xi]) + (Elx; X] — Xjb)V 

= (y; — Elyi[Xi])? + (Elyi|Xi] — Xjb) 

+ 2(y; — Ely; XJ) (Elx; X] — X;b). 


The first term doesn’t involve b and the last term has expec- 
tation zero by the CEF decomposition property (ii). The CEF 
approximation problem, (3.1.4), is therefore the same as the 
population least squares problem, (3.1.2). 


These two theorems give us two more ways to view regres- 
sion. Regression provides the best linear predictor for the 
dependent variable in the same way that the CEF is the best 
unrestricted predictor of the dependent variable. On the other 
hand, if we prefer to think about approximating E[y;|X;], as 
opposed to predicting Y;, the regression CEF theorem tells us 
that even if the CEF is nonlinear, regression provides the best 
linear approximation to it. 

The regression CEF theorem is our favorite way to motivate 
regression. The statement that regression approximates the 
CEF lines up with our view of empirical work as an effort to 
describe the essential features of statistical relationships with- 
out necessarily trying to pin them down exactly. The linear 
CEF theorem is for special cases only. The best linear pre- 
dictor theorem is satisfyingly general, but seems to encourage 
an overly clinical view of empirical research. We’re not really 
interested in predicting individual y;; it’s the distribution of Y; 
that we care about. 
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Figure 3.1.2 Regression threads the CEF of average weekly wages 
given schooling (dots = CEF; dashes = regression line). 


Figure 3.1.2 illustrates the CEF approximation property for 
the same schooling CEF plotted in figure 3.1.1. The regression 
line fits the somewhat bumpy and nonlinear CEF as if we were 
estimating a model for E[y;|X;] instead of a model for y;. In 
fact, that is exactly what’s going on. An implication of the 
regression CEF theorem is that regression coefficients can be 
obtained by using E[y;|X;] as a dependent variable instead of 
y; itself. To see this, suppose that X; is a discrete random 
variable with probability mass function g,(u). Then 


E{(E[vilXi] — X}b)?} = D> (Elvi|X; = u] —w'b) gx (u). 


u 


This means that 6 can be constructed from the weighted least 
squares (WLS) regression of E[y;|X; = u] on u, where u runs 
over the values taken on by X;. The weights are given by the 
distribution of X;, that is, gx(u). An even simpler way to see 
this is to iterate expectations in the formula for £: 


P = E[X;Xi]*ELX,x] = ELX;X))"ELX;E(y,[X))]._ (3.1.5) 
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The CEF or grouped data version of the regression formula 
is of practical use when working on a project that precludes 
the analysis of microdata. For example, Angrist (1998) used 
grouped data to study the effect of voluntary military service 
on earnings later in life. One of the estimation strategies used in 
this project regresses civilian earnings on a dummy for veteran 
status, along with personal characteristics and the variables 
used by the military to screen soldiers. The earnings data 
come from the U.S. Social Security system, but Social Secu- 
rity earnings records cannot be released to the public. Instead 
of individual earnings, Angrist worked with average earnings 
conditional on race, sex, test scores, education, and veteran 
status. 

To illustrate the grouped data approach to regression, we 
estimated the schooling coefficient in a wage equation using 21 
conditional means, the sample CEF of earnings given school- 
ing. As the Stata output reproduced in Figure 3.1.3 shows, a 
grouped data regression, weighted by the number of individu- 
als at each schooling level in the sample, produces coefficients 
identical to those generated using the underlying microdata 
sample with hundreds of thousands of observations. Note, 
however, that the standard errors from the grouped regres- 
sion do not measure the asymptotic sampling variance of the 
slope estimate in repeated micro-data samples; for that you 
need an estimate of the variance of Y; — Xj8. This variance 
depends on the microdata, in particular the second moments 
of W; = [y; X{]', a point we elaborate on in the next section. 


3.1.3 Asymptotic OLS Inference 


In practice, we don’t usually know what the CEF or the 
population regression vector is. We therefore draw statistical 
inferences about these quantities using samples. Statistical 
inference is what much of traditional econometrics is about. 
Although this material is covered in any econometrics text, we 
don’t want to skip the inference step completely. A review of 
basic asymptotic theory allows us to highlight the important 
fact that the process of statistical inference is distinct from the 
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A - Individual-level data 


. regress earnings school, robust 


Source ss df MS Number of obs = 409435 
SSsaaeaeen=== Pea teme sass eeseasaasresesSees= F( 1,409433) =49118.25 
Model 22631.4793 1 22631.4793 Prob > F = 0.0000 
Residual 188648.31 409433 .460755019 R-squared = 0.1071 
sssosssssssss peccscduenn e b aSeinadtinaana Adj R-squared = 0.1071 
Total 211279.789 409434 -51602893 Root MSE = 67879 
SSSeeeneseens P SE N NN N assess eeeasseseseeseee eas aaeee ese eesesas 
Robust Old Fashioned 

earnings Coef Std. Err Ë Std. Err. € 
SdSenewee eas eee aacammna tana aaa eSSeSS AA A AAEE sae eee EE 
school - 0674387 .0003447 195.63 - 0003043 221.63 

const 5.835761 -0045507 1282.39 -0040043 1457.38 


B - Means by years of schooling 


. regress average earnings school [aweight=count], robust 


(sum of wgt is 4.0944e+05) 

Source ss df MS Number of obs = 21 
------------- +------------------------------ F( 1, 19) = 540.31 
Model 1.16077332 1 1.16077332 Prob > F = 0.0000 
Residual - 040818796 19 .002148358 R-squared = 0.9660 
Sassssssssee= HSSHSSS Sees eaassesassesesesea== Adj R-squared = 0.9642 
Total 1.20159212 20 .060079606 Root MSE = 04635 
PEE ENE PERTEN EEEE EAA AS OEE EEA E ANNEANNE E AA A 

average Robust Old Fashioned 
_earnings Coef Std. Err. t Std. Err. t 
Ssanectasoseas POSEE TE ana teseeaseasae see eoee E E 
school 0674387 -0040352 16.71 .0029013 23.24 
const 5.835761 -0399452 146.09 : 0381792 152.85 


Figure 3.1.3 Microdata and grouped data estimates of the returns 
to schooling, from Stata regression output. Source: 1980 
Census—IPUMS, 5 percent sample. The sample includes white men, 
age 40-49. Robust standard errors are heteroskedasticity consistent. 
Panel A uses individual-level microdata. Panel B uses earnings 
averaged by years of schooling. 


question of how a particular set of regression estimates should 
be interpreted. Whatever a regression coefficient may mean, it 
has a sampling distribution that is easy to describe and use for 


statistical inference. 


4 


4The discussion of asymptotic OLS inference in this section is largely a con- 
densation of material in Chamberlain (1984). Important pitfalls and problems 
with asymptotic theory are covered in the last chapter. 
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We are interested in the distribution of the sample analog of 
b = EIX;X I EIX 


in repeated samples. Suppose the vector W; = [y; X;]' is inde- 
pendently and identically distributed in a sample of size N. 
A natural estimator of the first population moment, E[W;], is 
the sum, 4 XN; W;. By the law of large numbers, this vector 
of sample moments gets arbitrarily close to the corresponding 
vector of population moments as the sample size grows. We 
might similarly consider higher-order moments of the elements 
of Wj, for example the matrix of second moments, E[ W;W!], 
with sample analog $ >, W;W!. Following this principle, 
the method of moments estimator of £ replaces each expecta- 
tion by a sum. This logic leads to the ordinary least squares 
(OLS) estimator 


-1 
B = 5e X;X; `y X;Y;. 


Although we derived Ê as a method of moments estimator, it 
is called the OLS estimator of 6 because it solves the sample 
analog of the least squares problem described at the beginning 
of section 3.1.2.5 

The asymptotic sampling distribution of Ê depends solely on 
the definition of the estimand (i.e., the nature of the thing we’re 
trying to estimate, 6) and the assumption that the data con- 
stitute a random sample. Before deriving this distribution, it 
helps to summarize the general asymptotic distribution theory 
that covers our needs. This basic theory can be stated mostly 
in words. For the purposes of these statements, we assume the 
reader is familiar with the core terms and concepts of statisti- 
cal theory—moments, mathematical expectation, probability 


>Econometricians like to use matrices because the notation is so compact. 
Sometimes (not very often) we do too. Suppose X is the matrix whose rows 
are given by X; and y is the vector with elements y;, for i= 1,...,N. The 
sample moment matrix 4 >= X;X} is X'X/N and the sample moment vector 
A © X;y; is X’y/N. Then we can write B = (X'X)~!X’y, a widely used matrix 
formula. 
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limits, and asymptotic distributions. For definitions of these 
terms and a formal mathematical statement of the theoretical 
propositions given below, see Knight (2000). 


THE LAW OF LARGE NUMBERS Sample moments converge in 
probability to the corresponding population moments. In 
other words, the probability that the sample mean is close 
to the population mean can be made as high as you like by 
taking a large enough sample. 


THE CENTRAL LIMIT THEOREM Sample moments are asymp- 
totically normally distributed (after subtracting the corre- 
sponding population moment and multiplying by the square 
root of the sample size). The asymptotic covariance matrix is 
given by the variance of the underlying random variable. In 
other words, in large enough samples, appropriately normal- 
ized sample moments are approximately normally distributed. 


SLUTSKY’S THEOREM 


1. Consider the sum of two random variables, one of which 
converges in distribution (in other words, has an asymp- 
totic distribution) and the other converges in probability to 
a constant: the asymptotic distribution of this sum is unaf- 
fected by replacing the one that converges to a constant by 
this constant. Formally, let ay be a statistic with an asymp- 
totic distribution and let by be a statistic with probability 
limit b. Then ay + by and ay + b have the same asymptotic 
distribution. 

2. Consider the product of two random variables, one of which 
converges in distribution and the other converges in prob- 
ability to a constant: the asymptotic distribution of this 
product is unaffected by replacing the one that converges 
to a constant by this constant. Formally, let ay be a statis- 
tic with an asymptotic distribution and let by be a statistic 
with probability limit b. Then ayby and ayb have the same 
asymptotic distribution. 


THE CONTINUOUS MAPPING THEOREM Probability limits pass 
through continuous functions. For example, the probability 
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limit of any continuous function of a sample moment is the 
function evaluated at the corresponding population moment. 
Formally, the probability limit of b(bn) is b(b), where plim 
by = b and h(-) is continuous at b. 


THE DELTA METHOD Consider a vector-valued random vari- 
able that is asymptotically normally distributed. Continuously 
differentiable scalar functions of this random variable are also 
asymptotically normally distributed, with covariance matrix 
given by a quadratic form with the covariance matrix of the 
random variable on the inside and the gradient of the function 
evaluated at the probability limit of the random variable on 
the outside.® Formally, the asymptotic distribution of (bn) 
is normal with covariance matrix Vh(b)/QVh(b), where plim 
bn = b, h(-) is continuously differentiable at b with gradient 
Vh(b), and by has asymptotic covariance matrix Q.” 


We can use these results to derive the asymptotic distribu- 
tion of Ê in two ways. A conceptually straightforward but 
somewhat inelegant approach is to use the delta method: 6 
is a function of sample moments, and is therefore asymp- 
totically normally distributed. It remains only to find the 
covariance matrix of the asymptotic distribution from the 
gradient of this function. (Note that consistency of B comes 
immediately from the continuous mapping theorem).® An eas- 
ier and more instructive derivation uses the Slutsky and central 
limit theorems. Note first that we can write 


Y; = X;b + [yi — X;b] = X;p + ei, (3.1.6) 


where the residual e; is defined as the difference between the 
dependent variable and the population regression function, as 


6A quadratic form is a matrix-weighted sum of squares. Suppose v is an 
N x 1 vector and M is an N x N matrix. A quadratic form in v is v’Mv. If M is 
an N x N diagonal matrix with diagonal elements mj, then v'Mv = X; mjv?. 

7For a derivation of the delta method formula using the Slutsky and 
continuous mapping theorems, see Knight (2000, pp. 120-121). We say 
“the asymptotic distribution of h(bn),” but we really mean the asymptotic 
distribution of /N(h(bn) — b(b)). 

8 An estimator is said to be consistent when it converges in probability to 
the target parameter. 
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before. In other words, E[Xje;] = 0 is a consequence of B = 
E[X; X/]-'E[X;y;] and e; = Y; — X/B, and not an assumption 
about an underlying economic relation.’ 

Substituting the identity (3.1.6) for Y; in the formula for ĝ, 
we have 


p=6+(> xx] Y Xii. 


The asymptotic distribution of B is the asymptotic distri- 
bution of /N(f — B) = NIX X;X/] "= © X;e;. By the Slut- 
sky thore, this has the same asymptotic distribution as 
E[X;X; 1 REX: ei. Since E[X;e;] = 0, Em X Xie; is a root- 
N PN and centered sample moment. By the central 
limit theorem, this is asymptotically normally distributed with 
mean zero and covariance matrix E[X;X‘e?], since this matrix 
of fourth moments is the covariance matrix of X;e;. Therefore, 
Ê has an asymptotic normal distribution with probability limit 
b and covariance matrix 


E[X;X{)- | ELX;Xje? ELX;X/]1. (3.1.7) 


The theoretical standard errors used to construct f-statistics 
are the square roots of the diagonal elements of (3.1.7). 
In practice these standard errors are estimated by substitut- 
ing sums for expectations and using the estimated residuals, 
ĉi = Y; — XB to form the empirical fourth moment matrix, 
X IX;X;ê]/N. 

Asymptotic standard errors computed in this way are 
known as heteroskedasticity-consistent standard errors, White 
(1980a) standard errors, or Eicker-White standard errors, in 
recognition of Eicker’s (1967) derivation. They are also known 
as “robust” standard errors (e.g., in Stata). These standard 
errors are said to be robust because, in large enough samples, 
they provide accurate hypothesis tests and confidence inter- 
vals given minimal assumptions about the data and model. In 
particular, our derivation of the limiting distribution makes 


* Residuals defined in this way are not necessarily mean independent of X;; 
for mean independence, we need a linear CEF. 
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no assumptions other than those needed to ensure that basic 
statistical results like the central limit theorem go through. 
Robust standard errors are not, however, the standard errors 
that you get by default from packaged software. Default 
standard errors are derived under a homoskedasticity assump- 
tion, specifically, that E[e?|X;] = o?, a constant. Given this 
assumption, we have 


E[X;X{e?7] = E(X;X‘E[e?|X;]) = 07 E[X;X‘], 


by iterating expectations. The asymptotic covariance matrix 
of £ then simplifies to 


E[X;X{]-| E[X;X)e? JE[X;X‘]! 
= E[X;X)]}"'o? E[X;X)JE[X;X;]! 
= o EX X 1. (3.1.8) 


The diagonal elements of (3.1.8) are what SAS or Stata report 
unless you request otherwise. 

Our view of regression as an approximation to the CEF 
makes heteroskedasticity seem natural. If the CEF is nonlinear 
and you use a linear model to approximate it, then the quality 
of fit between the regression line and the CEF will vary with X;. 
Hence, the residuals will be larger, on average, at values of X; 
where the fit is poorer. Even if you are prepared to assume 
that the conditional variance of y; given X; is constant, the 
fact that the CEF is nonlinear means that E[(y; — X/B)*|X;] 
will vary with X;. To see this, note that 


Elly; — XB)" [Xi] 
= E{{(y; — Ely;|Xj]) + (Elyi/Xi] — Xj8) 1° 1X} 
= V[yi[Xi] + (Elyi|Xi] — X; (3.1.9) 
Therefore, even if V[y;|X;] is constant, the residual variance 


increases with the square of the gap between the regression 
line and the CEF, a fact noted in White (1980b).!° 


10The cross-product term resulting from an expansion of the squared term 
in the middle of (3.1.9) is zero because y; — E[y;|X;] is mean independent 
of X;. 
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In the same spirit, it’s also worth noting that while a linear 
CEF makes homoskedasticity possible, this is not a sufficient 
condition for homoskedasticity. Our favorite example in this 
context is the linear probability model (LPM). A linear prob- 
ability model is any regression where the dependent variable 
is zero-one, that is, a dummy variable such as an indicator 
for labor force participation. Suppose the regression model is 
saturated, so the CEF given regressors is linear. Because the 
CEF is linear, the residual variance is also the conditional vari- 
ance, V[y;|X;]. But the dependent variable is a Bernoulli trial 
with conditional variance P[y; = 1|X;](1 — P[y; = 1|X;]). We 
conclude that LPM residuals are necessarily heteroskedastic 
unless the only regressor is a constant. 

These points of principle notwithstanding, as an empirical 
matter, heteroskedasticity may matter little. In the microdata 
schooling regression depicted in figure 3.1.3, the robust stan- 
dard error is .0003447, while the old-fashioned standard error 
is .0003043, not much smaller. The standard errors from the 
grouped data regression, which are necessarily heteroskedas- 
tic if group sizes differ, change somewhat more; compare the 
.004 robust standard to the .0029 conventional standard error. 
Based on our experience, these differences are typical. If het- 
eroskedasticity matters a lot, say, more than a 30 percent 
increase or any marked decrease in standard errors, you should 
worry about possible programming errors or other problems. 
For example, robust standard errors below conventional may 
be a sign of finite-sample bias in the robust calculation. 

Finally, a brief note on the textbook approach to inference 
that you might have seen elsewhere. Traditional economet- 
ric inference begins with stronger assumptions than those we 
have invoked in this section. The traditional set-up, sometimes 
called a classical normal regression model, postulates: fixed 
(non-stochastic) regressors, a linear CEF, normally distributed 
errors, and homoskedasticity (see, e.g., Goldberger, 1991). 
These stronger assumptions give us two things: (1) unbiased- 
ness of the OLS estimator, (2) a formula for the sampling 
variance of the OLS estimator that is valid in small as well as 
large samples. Unbiasedness of the OLS estimators means that 
E[B] = £, a property that holds in a sample of any size and is 
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stronger than consistency, which means only that we can 
expect B to be close to £ in large samples. It’s easy to see when 
and why we get unbiasedness. In general, 


E(B] = +E T xx] Exel 


If the regressors are nonrandom (fixed in repeated samples) the 
expectation passes through and we have unbiasedness because 
E[e;] = 0. Otherwise, with random regressors, we can iterate 
expectations and get unbiasedness if E[e;|X;] = 0. This is true 
when the CEF is linear, but not in our more general “agnostic 
regression” framework. 

The variance formula obtained under classical assumptions 
is the same as the large-sample formula under homoskedastic- 
ity but—provided the strong classical assumptions are valid— 
this formula holds in a sample of any size. We’ve chosen to 
start with the asymptotic approach to inference because mod- 
ern empirical work typically leans heavily on the large-sample 
theory that lies behind robust variance formulas. The pay- 
off is valid inference under weak assumptions, in particular, a 
framework that makes sense for our less-than-literal approach 
to regression models. On the other hand, the large-sample 
approach is not without its dangers, a point we return to in 
the discussion of inference in chapter 8 and in the discussion 
of instrumental variables in chapter 4. 


3.1.4 Saturated Models, Main Effects, 
and Other Regression Talk 


We often discuss regression models using terms like saturated 
and main effects. These terms originate in an experimentalist 
tradition that uses regression to model the effects of discrete 
treatment-type variables. This language is now used more 
widely in many fields, however, including applied economet- 
rics. For readers unfamiliar with these terms, this section 
provides a brief review. 

Saturated regression models are regression models with 
discrete explanatory variables, where the model includes a 
separate parameter for all possible values taken on by the 
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explanatory variables. For example, when working with a 
single explanatory variable indicating whether a worker is a 
college graduate, the model is saturated by including a single 
dummy for college graduates and a constant. We can also sat- 
urate when the regressor takes on many values. Suppose, for 
example, that s; = 0,1,2,...,7. A saturated regression model 
for s; is 


Y; = a+ Bi dy; + Bodzj +---+ brdri + £i, 


where d; = 1[s; = j] is a dummy variable indicating schooling 
level j, and £; is said to be the jth-level schooling effect.'! Note 
that 


b; = Elyi|s; = j] — Elyils; = 0], 


while æ = E[y;|s; = 0]. In practice, you can pick any value of 
s; for the reference group; a regression model is saturated as 
long as it has one parameter for every possible j in E[y;|s; = 
j]. Saturated regression models fit the CEF perfectly because 
the CEF is a linear function of the dummy regressors used to 
saturate. This is an important special case of the linear CEF 
theorem. 

If there are two explanatory variables—say, one dummy 
indicating college graduates and one dummy indicating sex— 
the model is saturated by including these two dummies, their 
product, and a constant. The coefficients on the dummies are 
known as main effects, while the product is called an interac- 
tion term. This is not the only saturated parameterization; any 
set of indicators (dummies) that can be used to identify each 
value taken on by all covariates produces a saturated model. 
For example, an alternative saturated model includes dummies 
for male college graduates, male nongraduates, female college 
graduates, and female nongraduates, but no intercept. 

Here’s some notation to make this more concrete. Let x1; 
indicate college graduates and x2; indicate women. The CEF 


11We use the notation 1[s; = j] to denote the indicator function, in this case 
a function that creates a dummy variable switched on when s; = j. 
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given x1; and x2; takes on four values: 


Ely;|x1; = 0, x2; = 0], 
E 
E 


Elyj|x1; = 1, x2; = 1). 


Y;|x1; = 1, x2; = 0], 


[ ] 
[ ] 
[yilx1; = 0, x2; = 1], 
[ ] 


We can label these using the following scheme: 


Ely;|x1; = 0, x2; = 0] =a 

Efy;|x1; = 1, x2; = 0] 2 
E[Y;|x1; = 0, x2; = 1] = œ + 

ElY;lx1; = 1, x2; = 1] ee ae 


Since there are four Greek letters and the CEF takes on four 
values, this parameterization does not restrict the CEF. It can 
be written in terms of Greek letters as 


E[yj|x1j5 X23] = Æ + By x43 + y Xi +64 (%1;%2;), 


a parameterization with two main effects and one interaction 
term.!* The saturated regression equation becomes 


Y; = a+ p1X1i + Y Xi + 41 (%1;X2;) + £i. 


We can combine the multivalued schooling variable with 
sex to produce a saturated model that has t main effects 
for schooling, one main effect for sex, and t sex-schooling 
interactions: 


X= a+) pidit yx2i+ Y 8)(djixai) + £i. (3.1.10) 
j=l j=l 


The coefficients on the interaction terms, êj, tell us how each 
of the schooling effects differ by sex. The CEF in this case 


l2With a third dummy variable in the model, say x3;, a saturated model 
includes three main effects, three second-order interaction terms {x1;x2;, 
X1iX3i, X2iX3i}, and one third-order term, x1j;x2;x3;. 
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takes on 2(t +1) values, while the regression has this many 
parameters. 

Note that there is a hierarchy of increasingly restrictive mod- 
eling strategies with saturated models at the top. It’s natural 
to start with a saturated model because this fits the CEF. On 
the other hand, saturated models generate a lot of interac- 
tion terms, many of which may be uninteresting or estimated 
imprecisely. You might therefore sensibly choose to omit some 
or all of these terms. Equation (3.1.10) without interaction 
terms approximates the CEF using a purely additive model for 
schooling and sex. This is a good approximation if the returns 
to college are similar for men and women. In any case, school- 
ing coefficients in the additive specification give a (weighted) 
average return across both sexes, as discussed in section 3.3.1. 
On the other hand, it would be strange to estimate a model 
that included interaction terms but omitted the corresponding 
main effects. In the case of schooling, this is something like 


Yj = a+ yx + Y 8i(djixr)) + £i. (3.1.11) 
j=l 


This model allows schooling to shift wages only for women, 
something very far from the truth. Consequently, the results 
of estimating (3.1.11) are likely to be hard to interpret. 

Finally, it’s important to recognize that a saturated model 
fits the CEF perfectly regardless of the distribution of y;. For 
example, this is true for linear probability models and other 
limited dependent variable models (e.g., non-negative y;), a 
point we return to at the end of this chapter. 


3.2 Regression and Causality 


Section 3.1.2 shows how regression gives the best (MMSE) lin- 
ear approximation to the CEF. This understanding, however, 
does not help us with the deeper question of when regression 
has a causal interpretation. When can we think of a regression 
coefficient as approximating the causal effect that might be 
revealed in an experiment? 
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3.2.1 The Conditional Independence Assumption 


A regression is causal when the CEF it approximates is causal. 
This doesn’t answer the question, of course. It just passes the 
buck up one level, since, as we’ve seen, a regression inherits 
its legitimacy from a CEF. Causality means different things 
to different people, but researchers working in many disci- 
plines have found it useful to think of causal relationships in 
terms of the potential outcomes notation used in chapter 2 to 
describe what would happen to a given individual in a hypo- 
thetical comparison of alternative hospitalization scenarios. 
Differences in these potential outcomes were said to be the 
causal effect of hospitalization. The CEF is causal when it 
describes differences in average potential outcomes for a fixed 
reference population. 

It’s easiest to expand on the somewhat murky notion of a 
causal CEF in the context of a particular question, so let’s stick 
with the schooling example. The causal connection between 
schooling and earnings can be defined as the functional rela- 
tionship that describes what a given individual would earn if 
he or she obtained different levels of education. In particu- 
lar, we might think of schooling decisions as being made in 
a series of episodes where the decision maker can realistically 
go one way or another, even if certain choices are more likely 
than others. For example, in the middle of junior year, restless 
and unhappy, Angrist glumly considered his options: drop- 
ping out of high school and hopefully getting a job, staying 
in school but taking easy classes that would lead to a quick 
and dirty high school diploma, or plowing on in an academic 
track that would lead to college. Although the consequences 
of such choices are usually unknown in advance, the idea of 
alternative paths leading to alternative outcomes for a given 
individual seems uncontroversial. Philosophers have argued 
over whether this personal notion of potential outcomes is pre- 
cise enough to be scientifically useful, but individual decision 
makers seem to have no trouble thinking about their lives and 
choices in this manner (as in Robert Frost’s celebrated “The 
Road Not Taken”: the traveler-narrator sees himself looking 
back on a moment of choice. He believes that the decision to 
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follow the road less traveled “has made all the difference,” 
though he also recognizes that counterfactual outcomes are 
unknowable). 

In empirical work, the causal relationship between school- 
ing and earnings tells us what people would earn, on average, if 
we could either change their schooling in a perfectly controlled 
environment or change their schooling randomly so that those 
with different levels of schooling would be otherwise compara- 
ble. As we discussed in chapter 2, experiments ensure that the 
causal variable of interest is independent of potential outcomes 
so that the groups being compared are truly comparable. Here, 
we would like to generalize this notion to causal variables 
that take on more than two values, and to more complicated 
situations where we must hold a variety of control variables 
fixed for causal inferences to be valid. This leads to the condi- 
tional independence assumption (CIA), a core assumption that 
provides the (sometimes implicit) justification for the causal 
interpretation of regression estimates. This assumption is also 
called selection on observables because the covariates to be 
held fixed are assumed to be known and observed (e.g., in 
Goldberger, 1972; Barnow, Cain, and Goldberger, 1981). The 
big question, therefore, is what these control variables are, or 
should be. We’ll say more about that shortly. For now, we 
just do the econometric thing and call the covariates X;. As 
far as the schooling problem goes, it seems natural to imagine 
that X; is a vector that includes measures of ability and family 
background. 

For starters, think of schooling as a binary decision, such as 
whether Angrist goes to college. Denote this by a dummy vari- 
able, c;. The causal relationship between college attendance 
and a future outcome such as earnings can be described using 
the same potential outcomes notation we used to describe 
experiments in chapter 2. To address this question, we imagine 


two potential earnings variables: 
: Yy ifc;=1 

Potential outcome =} "€ | | ; 
Yọ; if C; = 0 


In this case, Yo; is ?s earnings without college, while Y4; is 
i’s earnings if he goes. We would like to know the difference 
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between Y1; and Yo;, which is the causal effect of college atten- 
dance on individual 7. This is what we would measure if we 
could go back in time and nudge 7 onto the road not taken. 
The observed outcome, y;, can be written in terms of potential 
outcomes as 

Y; = Yo; + (Y1; — Yo;)C;. 


We get to see one of Y1; or Yo;, but never both. We therefore 
hope to measure the average of Y1; — Yo;, or the average for 
some group, such as those who went to college. This is E[y1; — 
Yoilc; = 1). 

In general, comparisons of those who do and don’t go to 
college are likely to be a poor measure of the causal effect of 
college attendance. Following the logic in chapter 2, we have 


E[yilc; = 1] — Ely,|c; = 0] = Ely1i — Yoilci = 1] 
a m ammm 
Observed difference in earnings Average treatment effect on the treated 


+ Elyoilc; = 1] — E[Yo;|c; = 0]. 
Pc Ne ee 
Selection bias 


(3.2.1) 


It seems likely that those who go to college would have 
earned more anyway. If so, selection bias is positive and the 
naive comparison, E[y;|c; = 1] — E[y;|c; = 0], exaggerates the 
benefits of college attendance. 

The CIA asserts that conditional on observed characteristics, 
X;, selection bias disappears. Formally, this means 


{Yoi,¥ 4} Ik Ci|Xi, (3.2.2) 


where the symbol “u” denotes the independence relation 
and random variables to the right of the vertical bar are the 
conditioning set. Given the CIA, conditional-on-X; compar- 
isons of average earnings across schooling levels have a causal 
interpretation. In other words, 


Efy;|X;, c; = 1] — E[yi|X;, c; = 0] = Elyi; — Yo;|Xi). 


Now, wed like to expand the conditional independence 
assumption to causal relations that involve variables that can 
take on more than two values, such as years of schooling, s;. 
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The causal relationship between schooling and earnings is 
likely to be different for each person. We therefore use the 
individual-specific functional notation, 


Ysi = fils) 


to denote the potential earnings that person i would receive 
after obtaining s years of education. If s takes on only two 
values, 12 and 16, then we are back to the college/no college 
example: 


Yor = fi(12); Yu; = fi(16). 


More generally, the function f;(s) tells us what i would earn for 
any value of schooling, s. In other words, f;(s) answers causal 
“what if” questions. In the context of theoretical models of the 
relationship between human capital and earnings, the form of 
fi(s) may be determined by aspects of individual behavior, by 
market forces, or both. 

The CIA in this more general setup becomes 


Ys; IL s;|X;, for all s. (CIA) 


In many randomized experiments, the CIA crops up because 
s; is randomly assigned conditional on X; (in the Tennessee 
STAR experiment, for example, small classes were randomly 
assigned within schools). In an observational study, the CIA 
means that s; can be said to be “as good as randomly 
assigned,” conditional on X;. 

Conditional on X;, the average causal Hike of a one- 


year increase in schooling is E[f;(s) — f;(s — 1)|X;], while the 
average causal can of a four-year increase in See is 
Ef[fi(s) — Elfi(s — ]. The data reveal only y; = f;(s;), that 


is, f;(s) ix s=S;. a A the CIA, conditional-on-X; com- 
aa of average earnings across schooling levels have a 
causal interpretation. In other words, 
Elyi|Xi, s; = s] — Elyi|Xi, si = s — 1] 
= Elfi(s) — fils — 1)|X 


for any value of s. For example, we can compare the earnings 
of those with 12 and 11 years of schooling to learn about the 
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average causal effect of high school graduation: 
Elyi|Xj, s; = 12] — Efy;|Xi, s; = 11] 
= FUf(12)|X;, s; = 12] — ELf(11)1X;, s; = 11]. 


This comparison has a causal interpretation because, given the 
CIA, 


Effi(12)|Xi, s; = 12] — Effi(11)|Xi, s; = 11] 
= Effi( 12) — fil 11) IX; S; = 12]. 


Here, selection bias comes from differences in the potential 
dropout earnings of high school graduates and nongraduates. 
Given the CIA, however, high school graduation is indepen- 
dent of potential earnings conditional on X;, so the selection 
bias vanishes. Note also that in this case, the causal effect of 
graduating from high school on high school graduates is equal 
to the average high school graduation effect at X;: 


Eff(12) — f(11)|X;, s; = 12] = Elf(12) — (11) Xi. 


This is important, but less important than the elimination of 
selection bias. 

So far, we have constructed separate causal effects for each 
value taken on by the conditioning variables. This leads to as 
many causal effects as there are values of X;, an embarrass- 
ment of riches. Empiricists almost always find it useful to boil 
a set of estimates down to a single summary measure, such 
as the unconditional or overall average causal effect. By the 
law of iterated expectations, the unconditional average causal 
effect of high school graduation is 


E{E[y;|X;j, s; = 12] — E[y;|Xj, s; = 11]} (3.2.3) 
= E{E[fi(12) — fi(11)|Xi]} 
= Fif(12) — f;(11)]. (3.2.4) 


In the same spirit, we might be interested in the average causal 
effect of high school graduation on high school graduates: 
E{E[yil-Xi, S; = 12] = E[Y;|X;, S; = 11]|s; = 12} (3.2.5) 
= E{E[fi(12) — fi(11)[Xills; = 12} 
= Elfi(1 3.2.6) 
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This parameter tells us how much high school graduates 
gained by virtue of having graduated. Likewise, for the effects 
of college graduation there is a distinction between E[f;(16) — 
fi(12)|s; = 16], the average causal effect on college graduates, 
and E[f;(16) — f;(12)], the unconditional average effect. 

The unconditional average effect, (3.2.3), can be computed 
by averaging all the X-specific effects weighted by the marginal 
distribution of X;, while the average effect on high school or 
college graduates averages the X-specific effects weighted by 
the distribution of X; in these groups. In both cases, the empiri- 
cal counterpart is a matching estimator: we make comparisons 
across schooling groups for individuals with the same covari- 
ate values, compute the difference in their average earnings, 
and then average these differences in some way. 

In practice, there are many details to worry about when 
implementing a matching strategy. We fill in some of the tech- 
nical details on the mechanics of matching in section 3.3.1. 
Here we note that a drawback of the matching approach is 
that it is not automatic; rather, it requires two steps, matching 
and averaging. Estimating the standard errors of the resulting 
estimates may not be straightforward, either. A third con- 
sideration is that the two-way contrast at the heart of this 
subsection (high school or college completers versus dropouts) 
does not do full justice to the problem at hand. Since s; takes on 
many values, there are separate average causal effects for each 
possible increment in s;, which also must be summarized in 
some way.!? These considerations lead us back to regression. 

Regression provides an easy-to-use empirical strategy that 
automatically turns the CIA into causal effects. Two routes 
can be traced from the CIA to regression. One assumes that 
fi(s) is both linear in s and the same for everyone except for 
an additive error term, in which case linear regression is a 


'3For example, we might construct the average effect over s using the dis- 
tribution of s;. In other words, we estimate E[fj(s) — fi(s — 1)] for each s by 
matching, and then compute the average difference 


$L Elfils) — fils — 1)1P(s), 


where P(s) is the probability mass function for s;. This is a discrete approxi- 
mation to the average derivative, E[f/(s;)]. 
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natural tool to estimate the features of f;(s). A more general but 
somewhat longer route recognizes that f;(s) almost certainly 
differs for different people, and moreover need not be linear 
in s. Even so, allowing for random variation in f;(s) across 
people and for nonlinearity for a given person, regression can 
be thought of as a strategy for the estimation of a weighted 
average of the individual-specific difference, fj(s) — f(s — 1). 
In fact, regression can be seen as a particular sort of matching 
estimator, capturing an average causal effect, much as (3.2.3) 
or (3.2.5) does. 

At this point, we want to focus on the conditions required 
for regression to have a causal interpretation and not on the 
details of the regression-matching analog. We therefore start 
with the first route, a linear constant effects causal model. 
Suppose that 


fils) =œ + ps + ni. (3.2.7) 


In addition to being linear, this equation says that the func- 
tional relationship of interest is the same for everyone. Again, 
s is written without an i subscript, because equation (3.2.7) 
tells us what person 7 would earn for any value of s, and not 
just the realized value, s;. In this case, however, the only 
individual-specific and random part of fj(s) is a mean-zero 
error component, n;, which captures unobserved factors that 
determine potential earnings. 

Substituting the observed value s; for s in equation (3.2.7), 
we have 


Y; = A+ pSi + Nj. (3.2.8) 


Equation (3.2.8) looks like a bivariate regression model, 
except that equation (3.2.7) explicitly associates the coef- 
ficients in (3.2.8) with a causal relationship. Importantly, 
because equation (3.2.7) is a causal model, s; may be correlated 
with potential outcomes, fj(s), or, in this case, the residual term 
in (3.2.8), n;. 

Suppose now that the CIA holds given a vector of observed 
covariates, X;. In addition to the functional form assumption 
for potential outcomes embodied in (3.2.8), we decompose the 
random part of potential earnings, 7;, into a linear function of 
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observable characteristics, X;, and an error term, v;: 
/ 
ni = X;y + vi, 


where y is a vector of population regression coefficients that 
is assumed to satisfy E[n;|X;] = X;y. Since y is defined by the 
regression of n; on X;, the residual v; and X; are uncorrelated 
by construction. Moreover, by virtue of the CIA, we have 


E[fi(s)[Xi, si] = Elfi(s)|Xi] = œ + ps + E[ni|X] 
=a+ps+Xiy. 


The residual in the linear causal model 
Yj =a +psi+X;y +; (3.2.9) 


is therefore uncorrelated with the regressors, s; and X;, and 
the regression coefficient p is the causal effect of interest. 

It bears emphasizing once again that the key assumption 
here is that the observable characteristics, X;, are the only 
reason why n; and s; (equivalently, f;(s) and s;) are correlated. 
This is the selection-on-observables assumption for regression 
models discussed over a quarter century ago by Barnow, Cain, 
and Goldberger (1981). It remains the basis of most empirical 
work in economics. 


3.2.2 The Omitted Variables Bias Formula 


In addition to the variable of interest, s;, we have now intro- 
duced a set of control variables, X;, into our regression. The 
omitted variables bias (OVB) formula describes the relation- 
ship between regression estimates in models with different sets 
of control variables. This important formula is often motivated 
by the notion that a longer regression—one with controls, 
such as (3.2.9)—has a causal interpretation, while a shorter 
regression does not. The coefficients on the variables included 
in the shorter regression are therefore said to be biased. In 
fact, the OVB formula is a mechanical link between coefficient 
vectors that applies to short and long regressions whether or 
not the longer regression is causal. Nevertheless, we follow 
convention and refer to the difference between the included 
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coefficients in a long regression and a short regression as being 
determined by the OVB formula. 

To make this discussion concrete, suppose the relevant set 
of control variables in the schooling regression can be boiled 
down to a combination of family background, intelligence, 
and motivation. Let these specific factors be denoted by a vec- 
tor, Aj, which we refer to by the shorthand term “ability.” The 
regression of wages on schooling, s;, controlling for ability can 
be written as 


Y; =a+ps;t+Aiy + ei, (3.2.10) 


where g, p, and y are population regression coefficients and e; 
is a regression residual that is uncorrelated with all regressors 
by definition. If the CIA applies given A;, then p can be equated 
with the coefficient in the linear causal model, (3.2.7), while 
the residual e; is the random part of potential earnings that is 
left over after controlling for A;. 

In practice, ability is hard to measure. For example, the 
American Current Population Survey (CPS), a large data set 
widely used in applied microeconomics (and the source of 
U.S. government data on unemployment rates), tells us noth- 
ing about adult respondents’ family background, intelligence, 
or motivation. What are the consequences of leaving ability 
out of regression (3.2.10)? The resulting “short regression” 
coefficient is related to the “long regression” coefficient in 
equation (3.2.10) as follows: 


OMITTED VARIABLES BIAS FORMULA 


Cov(Y;, $;) 


= is 3.2.11 
V(s;) P+ Yy oa ( ) 


where ôas is the vector of coefficients from regressions of the 
elements of A; on s;. To paraphrase, the OVB formula says: 


Short equals long plus the effect of omitted times the regression 
of omitted on included. 


This formula is easy to derive: plug the long regression 
into the short regression formula, ae Not surprisingly, 
the OVB formula is closely related to the regression anatomy 
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formula, (3.1.3), from section 3.1.2. Both the OVB formula 
and the regression anatomy formula tell us that short and long 
regression coefficients are the same whenever the omitted and 
included variables are uncorrelated. '* 

We can use the OVB formula to get a sense of the likely con- 
sequences of omitting ability for schooling coefficients. Ability 
variables have positive effects on wages, and these variables 
are also likely to be positively correlated with schooling. The 
short regression coefficient may therefore be “too big” relative 
to what we want. On the other hand, as a matter of economic 
theory, the direction of the correlation between schooling and 
ability is not entirely clear. Some omitted variables may be 
negatively correlated with schooling, in which case the short 
regression coefficient may be too small. 1° 

Table 3.2.1 illustrates these points using data from the 
NLSY. The first three entries in the table show that the 
schooling coefficient decreases from .132 to .114 when fam- 
ily background variables—in this case, parents’ education—as 
well as a few basic demographic characteristics (age, race, 
census region of residence) are included as controls. Further 
control for individual ability, as proxied by the Armed Forces 
Qualification Test (AFQT) score, reduces the schooling coef- 
ficient to .087 (the AFQT is used by the military to select 
soldiers). The OVB formula tells us that these reductions are 
a result of the fact that the additional controls are positively 
correlated with both wages and schooling.'° 


14Here is the multivariate generalization of OVB: Let 6f denote the coeffi- 
cient vector on a K1 x 1 vector of variables, X4; in a (short) regression that has 
no other variables, and let g! denote the coefficient vector on these variables 
in a (long) regression that includes a K2 x 1 vector of additional variables, X2;, 
with coefficient vector p}. Then Bj = p! + E[X1; X4 7 ELX1X5,185. 

15 As highly educated people, we like to think that ability and schooling are 
positively correlated. This is not a foregone conclusion, however: Mick Jagger 
dropped out of the London School of Economics and Bill Gates dropped out 
of Harvard, perhaps because the opportunity cost of schooling for these high- 
ability guys was high (of course, they may also be a couple of very lucky 
college dropouts). 

16A large empirical literature investigates the consequences of omitting abil- 
ity variables from schooling equations. Key early references include Griliches 
and Mason (1972), Taubman (1976), Griliches (1977), and Chamberlain 
(1978). 
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TABLE 3.2.1 
Estimates of the returns to education for men in the NLSY 
(1) (2) (3) (4) (5) 
Col. (2) and Col. (4), with 
Age Additional Col. (3)and Occupation 
Controls: None Dummies Controls* AFQT Score Dummies 
.132 131 114 .087 .066 
(.007) (.007) (.007) (.009) (.010) 


Notes: Data are from the National Longitudinal Survey of Youth (1979 cohort, 2002 
survey). The table reports the coefficient on years of schooling in a regression of log 
wages on years of schooling and the indicated controls. Standard errors are shown in 
parentheses. The sample is restricted to men and weighted by NLSY sampling weights. 
The sample size is 2,434. 


* Additional controls are mother’s and father’s years of schooling, and dummy variables 
for race and census region. 


Although simple, the OVB formula is one of the most impor- 
tant things to know about regression. The importance of the 
OVB formula stems from the fact that if you claim an absence 
of omitted variables bias, then typically you’re also saying 
that the regression you’ve got is the one you want. And the 
regression you want usually has a causal interpretation. In 
other words, you’re prepared to lean on the CIA for a causal 
interpretation of the long regression estimates. 

At this point, it’s worth considering when the CIA is most 
likely to give a plausible basis for empirical work. The best- 
case scenario is random assignment of s;, conditional on X;, 
in some sort of (possibly natural) experiment. An example is 
the study of a mandatory retraining program for unemployed 
workers by Black et al. (2003). The authors of this study 
were interested in whether the retraining program succeeded 
in raising earnings later on. They exploited the fact that eli- 
gibility for the training program they studied was determined 
on the basis of personal characteristics and past unemploy- 
ment and job histories. Workers were divided into groups 
on the basis of these characteristics. While some of these 
groups of workers were ineligible for training, workers in other 
groups were required to take training if they did not take a 
job. When some of the mandatory training groups contained 
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more workers than training slots, training opportunities were 
distributed by lottery. Hence, training requirements were ran- 
domly assigned conditional on the covariates used to assign 
workers to groups. A regression on a dummy for training 
plus the personal characteristics, past unemployment vari- 
ables, and job history variables used to classify workers seems 
very likely to provide reliable estimates of the causal effect of 
training.!7 

In the schooling context, there is usually no lottery that 
directly determines whether someone will go to college or 
finish high school.!® Still, we might imagine subjecting individ- 
uals of similar ability and from similar family backgrounds to 
an experiment that encourages school attendance. The Educa- 
tion Maintenance Allowance, which pays British high school 
students in certain areas to attend school, is one such policy 
experiment (Dearden et al. 2003). 

A second scenario that favors the CIA leans on detailed insti- 
tutional knowledge regarding the process that determines s;. 
An example is the Angrist (1998) study of the effect of vol- 
untary military service on the later earnings of soldiers. This 
research asks whether men who volunteered for service in the 
U.S. armed forces were economically better off in the long 
run. Since voluntary military service is not randomly assigned, 
we can never know for sure. Angrist therefore used matching 
and regression techniques to control for observed differences 
between veterans and nonveterans who applied to the all- 
volunteer forces between 1979 and 1982. The motivation 
for a control strategy in this case is the fact that the military 
screens soldier applicants primarily on the basis of observable 
covariates like age, schooling, and test scores. 

The CIA in Angrist (1998) amounts to the claim that after 
conditioning on all these observed characteristics, veterans and 
nonveterans are comparable. This assumption seems worth 
entertaining since, conditional on X;, variation in veteran 
status in the Angrist (1998) study comes solely from the fact 


'7This program appears to raise earnings, primarily because workers 
offered training went back to work more quickly. 

181 otteries have been used to distribute private school tuition subsidies; see 
Angrist et al. (2002). 
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that some qualified applicants fail to enlist at the last minute. 
Of course, the considerations that lead a qualified applicant 
to “drop out” of the enlistment process could be related to 
earnings potential, so the CIA is clearly not guaranteed even 
in this case. 


3.2.3 Bad Control 


We’ve made the point that control for covariates can increase 
the likelihood that regression estimates have a causal interpre- 
tation. But more control is not always better. Some variables 
are bad controls and should not be included in a regression 
model even when their inclusion might be expected to change 
the short regression coefficients. Bad controls are variables that 
are themselves outcome variables in the notional experiment 
at hand. That is, bad controls might just as well be dependent 
variables too. Good controls are variables that we can think 
of as having been fixed at the time the regressor of interest was 
determined. 

The essence of the bad control problem is a version of selec- 
tion bias, albeit somewhat more subtle than the selection bias 
discussed in chapter 2 and section 3.2.1. To illustrate, suppose 
we are interested in the effects of a college degree on earnings 
and that people can work in one of two occupations, white 
collar and blue collar. A college degree clearly opens the door 
to higher-paying white collar jobs. Should occupation there- 
fore be seen as an omitted variable in a regression of wages 
on schooling? After all, occupation is highly correlated with 
both education and pay. Perhaps it’s best to look at the effect 
of college on wages for those within an occupation, say white 
collar only. The problem with this argument is that once we 
acknowledge the fact that college affects occupation, compar- 
isons of wages by college degree status within an occupation 
are no longer apples-to-apples comparisons, even if college 
degree completion is randomly assigned. 

Here is a formal illustration of the bad control problem in 
the college/occupation example.!? Let w; be a dummy variable 
that denotes white collar workers and let y; denote earnings. 


!The same problem arises in conditional-on-positive comparisons, dis- 
cussed in detail in section 3.4.2. 
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The realization of these variables is determined by college 
graduation status and potential outcomes that are indexed 
against c;. We have 


Y; = C1, + (1 — C;)Yo; 
W; = CiW1; + (1 — C;)Wo;, 


where c; = 1 for college graduates and is zero otherwise, 
{Y1 Yo;} denotes potential earnings, and {W1;, Wo;} denotes 
potential white collar status. We assume that c; is randomly 
assigned, so it is independent of all potential outcomes. We 
have no trouble estimating the causal effect of c; on either y; 
or W; since independence gives us 


Ely;|c; = 1] — Elyilc; = 0] = Elyii — Yo], 
Elw,|c; = 1] — E[w;|c; = 0] = Elwi; — wai. 
In practice, we can estimate these average treatment effects by 
regressing Y; and W; on Cj. 

Bad control means that a comparison of earnings condi- 
tional on w; does not have a causal interpretation. Consider 
the difference in mean earnings between college graduates and 
others, conditional on working at a white collar job. We can 
compute this in a regression model that includes w; or by 
regressing Y; on G; in the sample where w; = 1. The estimand 
in the latter case is the difference in means with c; switched 
off and on, conditional on w; = 1: 

Efy;|w; = 1, c; = 1] — Elyi|w; = 1,¢; = 0] 
= Efyii{wij = 1, ¢; = 1] — Efyo;|wo; = 1, c; = 0). 
(3.2.12) 
By the joint independence of {¥1;,W1;,Y0;,Wo;} and C;, we have 


Efyuilwi; = 1,c; = 1] — Elvoi|wo; = 1, c; = 0] 
= Efyij|wi; = 1] — Ef[vo;|wo; = 1]. 


This expression illustrates the apples-to-oranges nature of the 
bad control problem: 


Elyiilwi; = 1] — Elvoilwo; = 1] 
= E[Y1; — YoilW1; = 1] + {Elvolwi; = 1] — Elvos|wo; = 1]}. 
i ee 


Causal effect Selection bias 
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In other words, the difference in wages between those with 
and those without a college degree conditional on working in 
a white collar job equals the causal effect of college on those 
with w1; = 1 (people who work at a white collar job when they 
have a college degree) and a selection bias term that reflects 
the fact that college changes the composition of the pool of 
white collar workers. 

The selection bias in this context can be positive or nega- 
tive, depending on the relation between occupational choice, 
college attendance, and potential earnings. The main point is 
that even if Y1; = Yo;, so that there is no causal effect of col- 
lege on wages, the conditional comparison in (3.2.12) will not 
tell us this (the regression of y; on w; and c; has exactly the 
same problem). It is also incorrect to say that the conditional 
comparison captures the part of the effect of college that is 
“not explained by occupation.” In fact, the conditional com- 
parison does not tell us much that is useful without a more 
elaborate model of the links between college, occupation, and 
earnings.*° 

As an empirical illustration, we see that the addition of 
two-digit occupation dummies indeed reduces the schooling 
coefficient in the NLSY models reported in table 3.2.1, in 
this case from .087 to .066. However, it’s hard to say what 
we should make of this decline. The change in schooling 
coefficients when we add occupation dummies may simply 
be an artifact of selection bias. So we would do better to 
control only for variables that are not themselves caused by 
education. 

A second version of the bad control scenario involves proxy 
control, that is, the inclusion of variables that might partially 
control for omitted factors but are themselves affected by the 
variable of interest. A simple version of the proxy control story 
goes like this: Suppose you are interested in a long regression, 


20Tn this example, selection bias is probably negative, that is, E[yo;|w1; = 1] 
< Elyoi|woi = 1]. It seems reasonable to think that any college graduate can 
get a white collar job, so E[yo;|W1; = 1] is not too far from E[yo;]. But someone 
who gets a white collar job without benefit of a college degree (i.e., Wo; = 1) 
is probably special, that is, has a better than average Yo;. 


Making Regression Make Sense 67 
similar to (3.2.10), 
Y; =a+psjt+yaj+t ei, (3.2.13) 


where for the purposes of this discussion we’ve replaced the 
vector of controls A; with a scalar ability measure a;. Think 
of this as an IQ score that measures innate ability in eighth 
grade, before any relevant schooling choices are made (assum- 
ing everyone completes eighth grade). The error term in this 
equation satisfies E[s;e;] = E[aj;e;] = 0 by definition. Since a; 
is measured before s; is determined, it is a good control. 
Equation (3.2.13) is the regression of interest, but unfortu- 
nately, data on a; are unavailable. However, you have a second 
ability measure collected later, after schooling is completed 
(say, the score on a test used to screen job applicants). Call 
this variable late ability, aj. In general, schooling increases 
late ability relative to innate ability. To be specific, suppose 


Ali = To +7018; + T24i. (3.2.14) 


By this, we mean to say that both schooling and innate ability 
increase late or measured ability. There is almost certainly 
some randomness in measured ability as well, but we can make 
our point more simply via the deterministic link, (3.2.14). 
You’re worried about OVB in the regression of Y; on s; 
alone, so you propose to regress Y; on s; and late ability, aj, 
since the desired control, a;, is unavailable. Using (3.2.14) to 
substitute for a; in (3.2.13), the regression on s; and aj; is 


T TT 
Y= (« = v2) + (0 = v=) si agten 21s) 
T2 T2 T2 


In this scenario, y, 7, and m are all positive, so p — H is 
too small unless 2; turns out to be zero. In other words, use 
of a proxy control that is increased by the variable of inter- 
est generates a coefficient below the desired effect. But it is 
important to note that xı can be investigated to some extent: 
if the regression of aj; on s; is zero, you might feel better about 


assuming that zr is zero in (3.2.14). 
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There is an interesting ambiguity in the proxy control story 
that is not present in the first bad control story. Control for 
outcome variables is simply misguided; you do not want to 
control for occupation in a schooling regression if the regres- 
sion is to have a causal interpretation. In the proxy control 
scenario, however, your intentions are good. And while proxy 
control does not generate the regression coefficient of interest, 
it may be an improvement on no control at all. Recall that the 
motivation for proxy control is equation (3.2.13). In terms of 
the parameters in this model, the OVB formula tells us that 
a regression on s; with no controls generates a coefficient of 
P + Y ôas, where ôas is the slope coefficient from a regression of 
a; on s;. The schooling coefficient in (3.2.15) might be closer 
to p than the coefficient you estimate with no control at all. 
Moreover, assuming ôas is positive, you can safely say that the 
causal effect of interest lies between these two. 

One moral of both the bad control and the proxy control 
stories is that when thinking about controls, timing mat- 
ters. Variables measured before the variable of interest was 
determined are generally good controls. In particular, because 
these variables were determined before the variable of inter- 
est, they cannot themselves be outcomes in the causal nexus. 
Often, however, the timing is uncertain or unknown. In such 
cases, clear reasoning about causal channels requires explicit 
assumptions about what happened first, or the assertion that 
none of the control variables are themselves caused by the 
regressor of interest.7! 


3.3 Heterogeneity and Nonlinearity 


As we saw in the previous section, a linear causal model in 
combination with the CIA leads to a linear CEF with a causal 
interpretation. Assuming the CEF is linear, the population 


?1Griliches and Mason (1972) is a seminal exploration of the use of early 
and late ability controls in schooling equations. See also Chamberlain (1977, 
1978) for closely related studies. Rosenbaum (1984) offers an alternative dis- 
cussion of the proxy control idea using very different notation, outside a 
regression framework. 
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regression function is it. In practice, however, the assumption 
of a linear CEF is not really necessary for a causal interpreta- 
tion of regression. For one thing, as discussed in section 3.1.2, 
we can think of the regression of y; on X; and s; as providing 
the best linear approximation to the underlying CEF, regard- 
less of its shape. Therefore, if the CEF is causal, the fact that 
regression approximates it gives regression coefficients a causal 
flavor. This claim is a little vague, however, and the nature 
of the link between regression and the CEF is worth explor- 
ing further. This exploration leads us to an understanding of 
regression as a computationally attractive matching estimator. 


3.3.1 Regression Meets Matching 


The past decade or two has seen increasing interest in match- 
ing as an empirical tool. Matching as a strategy to control for 
covariates is typically motivated by the CIA, as with causal 
regression in the previous section. For example, Angrist (1998) 
used matching to estimate the effects of voluntary military ser- 
vice on the later earnings of soldiers. These matching estimates 
have a causal interpretation assuming that, conditional on the 
individual characteristics the military uses to select soldiers 
(age, schooling, test scores), veteran status is independent 
of potential earnings. Matching estimators are appealingly 
simple: at bottom, matching amounts to covariate-specific 
treatment-control comparisons, weighted together to produce 
a single overall average treatment effect. 

An attractive feature of matching strategies is that they are 
typically accompanied by an explicit statement of the con- 
ditional independence assumption required to give matching 
estimates a causal interpretation. At the same time, we have 
just seen that the causal interpretation of a regression coeffi- 
cient is based on exactly the same assumption. In other words, 
matching and regression are both control strategies. Since 
the core assumption underlying causal inference is the same 
for the two strategies, it’s worth asking whether or to what 
extent matching really differs from regression. Our view is that 
regression can be motivated as a particular sort of weighted 
matching estimator, and therefore the differences between 
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regression and matching estimates are unlikely to be of major 
empirical importance. 

To flesh out this idea, it helps to look more deeply into the 
mathematical structure of the matching and regressions esti- 
mands, that is, the population quantities that these methods 
attempt to estimate. For regression, of course, the estimand 
is a vector of population regression coefficients. The match- 
ing estimand is typically a weighted average of contrasts or 
comparisons across cells defined by covariates. This is easi- 
est to see in the case of discrete covariates, as in the military 
service example, and for a discrete regressor such as veteran 
status, which we denote here by the dummy pj. Since treat- 
ment takes on only two values, we can use the notation Y1; 
and Yo; to denote potential outcomes. A parameter of pri- 
mary interest in this context is the average effect of treatment 
on the treated, E[y1; — Yo;|D; = 1]. This tells us the difference 
between the average earnings of soldiers, E[y1;|D; = 1], an 
observable quantity, and the counterfactual average earnings 
they would have obtained if they had not served, E[yo;|D; = 1]. 
Simple comparisons of earnings by veteran status give a biased 
measure of the effect of treatment on the treated unless D; is 
independent of Yo;. Specifically, 


Efy;|D; = 1] — Ely;|D; = 0] 
= Elvi; — Yo;lD; = 1] + {Elyoi|D; = 1] — Elvo;|D; = OF}. 

In other words, the observed earnings difference by veteran 
status equals the average effect of treatment on the treated 
plus selection bias. This parallels the discussion of selection 
bias in chapter 2. 

The CIA in this context says that 

{Yois Y1;} 1L D;|X;. 


Given the CIA, selection bias disappears after conditioning on 
X;, so the effect of treatment on the treated can be constructed 
by iterating expectations over X;: 
dror = Elyii — Yo;|D; = 1] 
= F{E[Y1; — Yo;|X;, D; = 1]|D; = 1} 
= E{E Yi Xi, Dj; 1 E Yo; Xi, Dj 1 Dj = 1}. 
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Of course, E[Yo;|X;, Dj = 1] is counterfactual. By virtue of the 
CIA, however, 


Ef[yoi|Xi, D; = 0] = Efyo;|X;, Dj = 1). 


Therefore, 
ôror = Ef{E[y1;|Xi, D; = 1] — Efyo,|X;,D; = 0]|D; = 1} 
= E[dx|D; = 1], (3.3.1) 
where 


ôx = Ely;|X;, D; = 1] — Ely;|X;, D; = 0], 


is the difference in mean earnings by veteran status at each 
value of X;. At a particular value, say X; = x, we write ôx. 

The matching estimator in Angrist (1998) uses the fact that 
X; is discrete to construct the sample analog of the right-hand 
side of (3.3.1). In the discrete case, the matching estimand can 
be written 


E[Y1; — Yo;lD; = 1] = X bx P(X: = x|D; = 1), = (3.3.2) 


where P(X; = x|D; = 1) is the probability mass function for 
X; given D; = 1.” In this case, X; takes on values deter- 
mined by all possible combinations of year of birth, test score 
group, year of application to the military, and educational 
attainment at the time of application. The test score in this 
case is from the AFQT, used by the military to categorize the 
mental abilities of applicants (we included this as a control 
in the schooling regression discussed in section 3.2.2). The 
Angrist (1998) matching estimator replaces ôx by the sample 
veteran-nonveteran earnings difference for each combination 
of covariates and then combines these in a weighted average 
using the empirical distribution of covariates among veterans. 


22This matching estimator is discussed by Rubin (1977) and used by 
Card and Sullivan (1988) to estimate the effect of subsidized training on 
employment. 
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Note also that we can just as easily construct the uncondi- 
tional average treatment effect, 
E{E[y1i|Xi, Di = 1] — Elyoi|Xi, D; = O}} 
= )°5,P(X; = x) = Elvi — Yoj]. (3.3.3) 


SATE 


This is the expectation of ôx using the marginal distribution of 
X; instead of the distribution among the treated. ŝror tells us 
how much the typical soldier gained or lost as a consequence 
of military service, while Sarg tells us how much the typical 
applicant to the military gained or lost (since the Angrist, 1998, 
population consists of applicants.) 

The U.S. military tends to be fairly picky about its soldiers, 
especially after downsizing at the end of the cold war. For the 
most part, the military now takes only high school graduates 
with test scores in the upper half of the test score distribution. 
Applicant screening therefore generates positive selection bias 
in naive comparisons of veteran and nonveteran earnings. This 
can be seen in table 3.3.1, which reports differences-in-means, 
matching, and regression estimates of the effect of voluntary 
military service on the 1988-91 Social Security-taxable earn- 
ings of men who applied to join the military between 1979 and 
1982. The matching estimates were constructed from the sam- 
ple analog of (3.3.2). Although white veterans earned $1,233 
more than white nonveterans, this estimated veteran effect 
becomes negative once differences in covariates are matched 
away. Similarly, while nonwhite veterans earned $2,449 more 
than nonwhite nonveterans, controlling for covariates reduces 
this difference to $840. 

Table 3.3.1 also shows regression estimates of the effect 
of voluntary military service, controlling for the same set of 
covariates that were used to construct the matching estimates. 
These are estimates of dz in the equation 


Yi = J dixotx + RD; + €i, (3.3.4) 


where diy = 1[X; = x] is a dummy variable that indicates 
X;=x, dy is a regression effect for X; =x and dp is the 
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TABLE 3.3.1 
Uncontrolled, matching, and regression estimates of the effects of voluntary 
military service on earnings 
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Average Differences 
Earnings in Means Regression 
in 1988- by Veteran Matching Regression Minus 
1991 Status Estimates Estimates Matching 
Race (1) (2) (3) (4) (5) 
Whites 14,537 1,233.4 —197.2 —88.8 108.4 
(60.3) (70.5) (62.5) (28.5) 
Non- 11,664 2,449.1 839.7 1,074.4 234.7 
whites (47.4) (62.7) (50.7) (32.5) 


Notes: Adapted from Angrist (1998, tables II and V). Standard errors are reported 
in parentheses. The table shows estimates of the effect of voluntary military service on 
the 1988-91 Social Security-taxable earnings of men who applied to enter the armed 
forces between 1979 and 1982. The matching and regression estimates control for 
applicants’ year of birth, education at the time of application, and AFQT score. There 
are 128,968 whites and 175,262 nonwhites in the sample. 


regression estimand. Note that this regression model allows 
a separate parameter for every value taken on by the covari- 
ates. This model can therefore be said to be saturated-in-X;, 
since it includes a parameter for every value of X;. It is not fully 
saturated, however, because there is a single additive effect for 
D; with no D; - X; interactions. 

Despite the fact that the matching and regression estimates 
control for the same variables, the regression estimates in 
table 3.3.1 are somewhat larger for nonwhites and less nega- 
tive for whites. In fact, the differences between the matching 
and regression results are statistically significant. At the same 
time, the two estimation strategies present a broadly similar 
picture of the effects of military service. The reason the regres- 
sion and matching estimates are similar is that regression, too, 
can be seen as a sort of matching estimator: the regression esti- 
mand differs from the matching estimands only in the weights 
used to combine the covariate-specific effects, ôx, into a single 
average effect. In particular, while matching uses the distri- 
bution of covariates among the treated to weight covariate- 
specific estimates into an estimate of the effect of treatment on 
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the treated, regression produces a variance-weighted average 
of these effects. 

To see this, start by using the regression anatomy formula 
to write the coefficient on D; in the regression of Y; on X; and 
D; as 


doe pene (3.3.5) 
_ E[(p; — E[p;|X,])yi] 
E[(p; — E[D;|X;)2] 

_ EUD; = Elpil Xi) Elvi|Di, Xi} (3.3.6) 


E[(p; — E[p,|Xi])7] 


The second equality in this set of expressions uses the fact that 
saturating the model in X; means E[p;|X;] is linear. Hence, Ď;, 
which is defined as the residual from a regression of D; on Xj, 
is the difference between D; and E[p,|X;]. The third equality 
uses the fact that the regression of Y; on D; and X; is the same 
as the regression of yY; on E[y;|D;,X;] (this we know from the 
regression CEF theorem, 3.1.6). 

To simplify further, we expand the CEF, E[y;|D;,X;], to get 


Ely;|D;, Xi] = Ely;|D; = 0, X;] + 6xDj;, 


and then substitute for E[y;|D;,X;] in the numerator of (3.3.6). 
This gives 


E{(D; — E[p;|X;]) E[y;|D;, X;]} 
= E{(p,; — E[p,|Xj])E[y;|D; = 0, X;]} 
+ E{(D; — E[D;|X;])D;ôx}. 


The first term on the right-hand side is zero because E[y;|D; = 
0,X;] is a function of X; only and is therefore uncorrelated 
with (D; — E[p;|X;]). Similarly, the second term simplifies to 


E{(p; — E[p;|X;])Di5x} = E{(D; — Efp;|X;])*5x}. 
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At this point, we’ve shown 


E[(p; — E[p;|Xi])76x] 


êr = Erp, — Elp KI] 
_ E{E[(p; — Elpi|Xi])°(Xi18x} _ Elop(Xi)5x] saa 
~ -E{E[(p; — E[p,|Xi])?|Xi}} Elo (Xi) 7 
where 


op(Xi) = El(p; — EID; XIX] 


is the conditional variance of D; given X;. This establishes that 
the regression model, (3.3.4), produces a treatment-variance 
weighted average of dx. 

Because the regressor of interest, D;, is a dummy variable, 
one last step can be taken. In this case, oġ(X;) = P(D; = 
1|X;)(1 — P(b; = 1|X;)), so 


S > dxf P(Dj = 1|X; = x)(1 — P(p; = 1|X; = x))] P(X; = x) 


ôR = : 
£ SOPE; = 11K; = x)(1— P(o; = 1X; = x))IPC%; = x) 
This shows that the regression estimand weights each 
covariate-specific treatment effect by [P(X; = x|D; = 1)(1 — 
P(X; = x|D; = 1))]P(X; = x). In contrast, the matching esti- 
mand for the effect of treatment on the treated can be 
written 


Ely; —YoiIDi = 1] = X bxP(Xi = x|D; = 1) 


6 Pippy 11X; = x)P(X; = x) 


3 


SS P(o; = 11X; = x)P(X; = x) 
using the fact that 
P(o; = 1|X; = x) - P(X; = x) 
P(p; = 1) 


So the weights used to construct E[Y1; — Yo;|D; = 1] are pro- 
portional to the probability of treatment at each value of the 


P(X; = x|D; = 1)= 
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covariates. The regression and matching weighting schemes 
therefore differ unless treatment is independent of covariates. 

An important point coming out of this derivation is that 
the treatment-on-the-treated estimand puts the most weight 
on covariate cells containing those who are most likely to be 
treated. In contrast, regression puts the most weight on covari- 
ate cells where the conditional variance of treatment status 
is largest. As a rule, treatment variance is maximized when 
P(D; = 1|X; = x) = 4, in other words, for cells where there are 
equal numbers of treated and control observations. The differ- 
ence in weighting schemes is of little importance if 5, does not 
vary across cells (though weighting still affects the statistical 
efficiency of estimators). In this example, however, men who 
were most likely to serve in the military appear to benefit least 
from their service. This is probably because those most likely 
to serve were most qualified and therefore also had the highest 
civilian earnings potential. This fact leads matching estimates 
of the effect of military service to be smaller than regression 
estimates based on the same vector of control variables.” 

Also important is the fact that neither the regression nor 
the covariate-matching estimands give any weight to covariate 
cells that do not contain both treated and control observations. 
Consider a value of X;, say x*, where either no one is treated 
or everyone is treated. Then, ôx» is undefined, while the regres- 
sion weights, [P(p; = 1|X; = x*)(1 — P(p; = 1|X; = x*))], are 
zero. In the language of the econometric literature on match- 
ing, with saturated control for covariates both the regression 
and matching estimands impose common support, that is, they 
are limited to covariate values where both treated and control 
observations are found.”4 


?31t’s no surprise that regression gives the most weight to cells where P(D; = 
1X; =x) = 5 since regression is efficient for a homoskedastic constant effects 
linear model. We should expect an efficient estimator to give the most weight 
to cells where the common treatment effect is estimated most precisely. With 
homoskedastic residuals, the most precise treatment effects come from cells 
where the probability of treatment equals L. 

24The support of a random variable is the set of realizations that occur 
with positive probability. See Heckman, Ichimura, Smith, and Todd (1998) 
and Smith and Todd (2001) for a discussion of common support in matching. 
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The step from estimand to estimator is a little more com- 
plicated. In practice, both regression and matching estimators 
are implemented using modeling assumptions that implicitly 
involve a certain amount of extrapolation across cells. For 
example, matching estimators often combine covariate cells 
with few observations. This violates common support if the 
cells being combined do not all have both treated and non- 
treated observations. Regression models that are not saturated 
in X; may also violate common support, since covariate cells 
without both treated and control observations can end up 
contributing to the estimates by extrapolation. Here, too, 
however, we see a symmetry between the matching and regres- 
sion strategies: they are in the same class, in principle, and 
require the same sort of compromises in practice.*> 


Even More on Regression and Matching: Ordered 
and Continuous Treatments* 


Does the quasi-matching interpretation of regression outlined 
above for a binary treatment variable apply to models with 
ordered and continuous treatments? The long answer is fairly 
technical and may be more than you want to know. The short 
answer is, to one degree or another, yes. 

As we’ve already discussed, the population OLS slope vector 
always provides the MMSE linear approximation to the CEF. 
This, of course, works for ordered and continuous regres- 
sors as well as for binary. A related property is the fact that 
regression coefficients have an “average derivative” interpre- 
tation. In multivariate regression models, this interpretation 
is unfortunately complicated by the fact that the OLS slope 
vector is a matrix-weighted average of the gradient of the 


25Matching problems involving finely distributed X-variables are often 
solved by aggregating values to make coarser groupings or by pairing observa- 
tions that have similar, though not necessarily identical, values. See Cochran 
(1965), Rubin (1973), or Rosenbaum (1995, chapter 3) for discussions of this 
approach. With continuously distributed covariates, matching estimators are 
biased because matches are imperfect. Abadie and Imbens (2008) have recently 
shown that a regression-based bias correction can eliminate the (asymptotic) 
bias from imperfect matches. 
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CEF. Matrix-weighted averages are difficult to interpret except 
in special cases (see Chamberlain and Leamer, 1976). An 
important special case when the average derivative property 
is relatively straightforward is in regression models for an 
ordered or continuous treatment with a saturated model for 
covariates. To avoid lengthy derivations, we simply explain 
the formulas. A derivation is sketched in the appendix to this 
chapter. For additional details, see the appendix to Angrist 
and Krueger (1999). 

For the purposes of this discussion, the treatment intensity, 
S;, is assumed to be a continuously distributed random vari- 
able, not necessarily non-negative. Suppose that the CEF of 
interest can be written h(t) = E[y;|s; = t] with derivative h’(t). 
Then 


Elyi(si— Elsil)] _ J b'(t)uidt 
E{s;(s; — E[s;])] Judt ’? 


(3.3.8) 


where 


ur = {Efs;ls; > t] — E[sj|s; < t]}{P(s; > t)[1 — P(s; > t)}, 
(3.3.9) 


and the integrals in (3.3.8) run over the possible values of 
s; This formula (derived by Yitzhaki, 1996) weights each 
possible value of s; in proportion to the difference in the 
conditional mean of s; above and below that value. More 
weight is also given to points close to the median of s;, since 
P(s; > t) - [1 — P(s; > t)] is maximized there. 

With covariates, X;, the weights in (3.3.8) become X- 
specific. A covariate-averaged version of the same formula 
applies to the multivariate regression coefficient of Y; on s;, 
after partialing out X;. In particular, 


Elyi(s;— Elsj|Xi])] _ ELf b(t)u:xdt] 
Efs;(s;— Els;|Xi])] Ef wexdt] ” 
where h(t) = SED Pensa and 
Hix = {Els;|X;, s; = t] — E[s;|X;,$; < t]} 
x {P(s; = t|X;)[1 — P(s; = tX;)]}. 


(3.3.10) 
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Equation (3.3.10) reflects two types of averaging: an inte- 
gral that averages along the length of a nonlinear CEF at 
fixed covariate values, and an expectation that averages across 
covariate cells. An important point in this context is that pop- 
ulation regression coefficients contain no information about 
the effect of s; on the CEF for values of X; where P(s; > t|X;) 
equals zero or one. This includes values of X; where s; is 
fixed. It’s also worth noting that if s; is a dummy variable, we 
can extract equation (3.3.7) from the more general formula, 
(3.3.10). 

Angrist and Krueger (1999) constructed the average weight- 
ing function for a schooling regression with state of birth 
and year of birth covariates. Although equations (3.3.8) and 
(3.3.10) may seem arcane or at least nonobvious, in this exam- 
ple the average weights, E[j;x], turn out to be a reasonably 
smooth symmetric function of t, centered at the mode of s;. 

The implications of (3.3.8) or (3.3.10) can be explored fur- 
ther given a model for the distribution of regressors. Suppose, 
for example, that s; is normally distributed. Let z; = © SEs) Si) 
where ø, is the standard deviation of s;, so that z; is Saad 
normal. Then 


E[si|s; > t] = E(s;) + osE Ec > = 


= E(s;) + 0E [zilzi = t“ ]. 


From truncated normal formulas (see, e.g., Johnson and Kotz, 
1970), we know that 
p(t*) 
E[zilz; > t] = —————_ and Efgi|z; < t] = i 
[1 — ®(t*)] ®(t*) 

where ¢(-) and ®(-) are the standard normal density and dis- 
tribution functions. Substituting in the formula for us, (3.3.9), 
we have 


&(z*) | b(t") = osot”). 


= | p(t") oe 
Me =o); [ 


1 — &(t*)] @(t*) 
We have therefore shown that 
Cov(Y;, Si) 


Weir > E[h'(s))]. 
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In other words, when s; is normal, the regression of Y; on s; is 
the unconditional average derivative, E[h'(s;)]. Of course, this 
result is a special case of a special case.” Still, it seems reason- 
able to imagine that normality might not matter very much. 
And in our empirical experience, the average derivatives (also 
called “marginal effects”) constructed from parametric nonlin- 
ear models (e.g., probit or Tobit) are usually indistinguishable 
from the corresponding regression coefficients, regardless of 
the distribution of regressors. We expand on this point in 
section 3.4.2. 


3.3.2 Control for Covariates Using 
the Propensity Score 


The most important result in regression theory is the OVB 
formula, which tells us that coefficients on included variables 
are unaffected by the omission of variables when the vari- 
ables omitted are uncorrelated with the variables included. 
The propensity score theorem, due to Rosenbaum and Rubin 
(1983), extends this idea to estimation strategies that rely on 
matching instead of regression, where the causal variable of 
interest is a treatment dummy.”” 

The propensity score theorem says that if potential out- 
comes are independent of treatment status conditional on a 
multivariate covariate vector X;, then potential outcomes are 
independent of treatment status conditional on a scalar func- 
tion of covariates, the propensity score, defined as p(X;) = 
E[p;|X;] = P[p; = 1|X;]. Formally, we have the following 
theorem: 


Theorem 3.3.1 The Propensity Score Theorem. 
Suppose the CIA holds such that {y¥o;,¥1j} L Dj|X;. Then 
{Yoi, Y1;} 4 D;lp(X;). 


26 Other specialized results in this spirit appear in Yitzhaki (1996) and Ruud 
(1986), who considers distribution-free estimation of limited-dependent- 
variable models. 

27 Propensity score methods can be adapted to multivalued treatments, 
though this has yet to catch on. See Imbens (2000) for an effort in this 
direction. 
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Proof. It’s enough to show that PID; = 1|y;i, p(X;)] does not 
depend on y;; for j = 0, 1: 


PID; = 1Y; p(Xi)] = Elp;lyji, P(X) 
= E{E[Djl¥ji, p(X), XAY; P(Xi)} 
= E{EPpilyj, XY; PCXi)} 
= E{E[p;| XAY; p(Xi)}, by the CIA. 


But E{E[D;j|Xi]|¥ji, P(Xi)} = E{p(Xi)|¥ji, P(Xi)}, which is clearly 
just p(X;). 


Like the OVB formula for regression, the propensity score 
theorem says that you need only control for covariates that 
affect the probability of treatment. But it also says something 
more: the only covariate you really need to control for is the 
probability of treatment itself. In practice, the propensity score 
theorem is usually used for estimation in two steps: first, p(X;) 
is estimated using some kind of parametric model, say, logit or 
probit. Then estimates of the effect of treatment are computed 
either by matching on the estimated score from this first step or 
using a weighting scheme described below (see Imbens, 2004, 
for an overview). 

Direct propensity score matching works in the same way as 
covariate matching except that we match on the score instead 
of the covariates directly. By the propensity score theorem and 
the CIA, 


Ely; — YoiIDi = 1] 
= E{E[y;|p(Xi), D; = 1] — E[yi|p(Xi), Dj = 0]|D; = 1}. 


Estimates of the effect of treatment on the treated can therefore 
be obtained by stratifying on an estimate of p(X;) and sub- 
stituting conditional sample averages for expectations or by 
matching each treated observation to controls with similar val- 
ues of the propensity score (both of these approaches were used 
by Dehejia and Wahba, 1999). Alternatively, a model-based 
or nonparametric estimate of E[y;|p(X;),D;] can be substituted 
for these conditional mean functions and the outer expectation 
replaced with a sum (as in Heckman, Ichimura, and Todd, 
1998). 
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The somewhat niftier weighting approach to propensity 
score estimation skips the cumbersome matching step by 
exploiting the fact that the CIA implies El Fix noi 5] = Efy,;] and 


E[ ir 1] = = E[Yo;].? Therefore, given a ae for estimat- 
1 
ing p(X;), we can construct estimates of the average treatment 


effect from the sample analog of 
YiDi — Yi(1— = | 
E[y i — Yọo;] = E 
ee Fes 1= p(X) 


E aem] (3.3.11) 


This last expression is an estimand of the form suggested by 
Newey (1990) and Robins, Mark, and Newey (1992). We can 
similarly calculate the effect of treatment on the treated from 
the sample analog of: 


(Dj — p(X;))Y; 
1 — p(X;))P(D; = 1) 


The idea that you can correct for nonrandom sampling via 
weighting by the reciprocal of the probability of selection dates 
back to Horvitz and Thompson (1952). Of course, to make 
this approach feasible, and for the resulting estimates to be 
consistent, we need a consistent estimator of p(X;). 

The Horvitz-Thompson version of the propensity score 
approach is appealing, since the estimator is essentially auto- 
mated, with no cumbersome matching required. The Horvitz- 
Thompson approach also highlights the close link between 
propensity score matching and regression, much as discussed 
for covariate matching in section 3.3.1. Consider again the 
regression estimand, ôg, for the population regression of Y; 
on D;, controlling for a saturated model for covariates. This 
estimand can be written 


E[Y1; — Yo;lD; = 1] = E f | (3.3.12) 


ELD; — p(X vi] 
ôg = 3.3.13 
R= EPA- p(X)" ae 


28To see this, iterate over X;: Elz i]= E{EL Fx. 
Se = Elvild; = 1, Xi] EfvilXi). 


IXi]}s El pic 
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The two Horvitz-Thompson matching estimands, (3.3.11) and 
(3.3.12), and the regression estimand are all in the class of 
weighted average estimands considered by Hirano, Imbens, 
and Ridder (2003): 


ef xa | oe ast (3.3.14) 
Be 60S) pom) |)” A 


where g(X;) is a known weighting function. (To go from esti- 
mand to estimator, replace p(X;) with a consistent estimator 
and replace expectations with sums.) For the average treat- 
ment effect, set g(X;) = 1; for the effect on the treated, set 


g(X;) = pes and for regression, set 


p(X: — p(Xi)) 
E[p(Xi)(1 — p(X) 


This similarity highlights once again the fact that regression 
and matching—including propensity score matching—are not 
really different animals, at least not until we specify a model 
for the propensity score. 

A big question here is how best to model and estimate 
p(X;), or how much smoothing or stratification to use when 
estimating E[y;|p(X;), D;], especially if the covariates are con- 
tinuous. The regression analog of this question is how to 
parameterize the control variables (e.g., polynomials or main 
effects and interaction terms if the covariates are coded as dis- 
crete). The answer to this is inherently application-specific. A 
growing empirical literature suggests that a logit model for 
the propensity score with a few polynomial terms in contin- 
uous covariates works well in practice, though this cannot 
be a theorem, and inevitably, some experimentation will be 
required (see, e.g., Dehejia and Wahba, 1999).?? 

A developing theoretical literature has produced some 
thought-provoking theorems on the efficient use of the propen- 
sity score. First, from the point of view of asymptotic effi- 
ciency, there is usually a cost to matching on the propensity 


g(Xj) = 


29 Andrea Ichino and Sascha Becker have posted Stata programs that imple- 
ment various matching estimators; see Becker and Ichino (2002). 
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score instead of full covariate matching. We can get lower 
asymptotic standard errors by matching on any covariate that 
explains outcomes, whether or not it turns up in the propen- 
sity score. This we know from Hahn’s (1998) investigation of 
the maximal precision of estimates of treatment effects under 
the CIA, with and without knowledge of the propensity score. 
For example, in Angrist (1998), there is an efficiency gain from 
matching on year of birth, even if the probability of serving 
in the military is unrelated to birth year, because earnings are 
related to birth year. A regression analog for this point is the 
result that even in a scenario with no OVB, the long regres- 
sion generates more precise estimates of the coefficients on the 
variables included in a short regression whenever the omit- 
ted variables have some predictive power for outcomes (see 
section 3.1.3). 

Hahn’s (1998) results raise the question of why we should 
ever bother with estimators that use the propensity score. A 
philosophical argument is that the propensity score rightly 
focuses researcher attention on models for treatment assign- 
ment, something about which we may have reasonably good 
information, instead of the typically more complex and mys- 
terious process determining outcomes. This view seems espe- 
cially compelling when treatment assignment is the product 
of human institutions or government regulations, while the 
process determining outcomes is more anonymous (e.g., a mar- 
ket). For example, in a time series evaluation of the causal 
effects of monetary policy, Angrist and Kuersteiner (2004) 
argue that we know more about how the Federal Reserve sets 
interest rates than about the process determining GDP. In the 
same spirit, it may also be easier to validate a model for treat- 
ment assignment than to validate a model for outcomes (see 
Rosenbaum and Rubin, 1985, for a version of this argument). 

A more precise though purely statistical argument for using 
the propensity score is laid out in Angrist and Hahn (2004). 
This paper shows that even though there is no asymptotic 
efficiency gain from the use of estimators based on the propen- 
sity score, there will often be a gain in precision in finite 
samples. Since all real data sets are finite, this result is empir- 
ically relevant. Intuitively, if the covariates omitted from the 
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propensity score explain little of the variation in outcomes (in 
a purely statistical sense), it may be better to ignore them than 
to bear the statistical burden imposed by the need to estimate 
their effects. This is easy to see in studies using data sets such as 
the NLSY, where there are hundreds of covariates that might 
predict outcomes. In practice, we focus on a small subset of 
all possible covariates. This subset is usually chosen with an 
eye to what predicts treatment. 

Finally, Hirano, Imbens, and Ridder (2003) provide an 
alternative asymptotic resolution of the “propensity score 
paradox” generated by Hahn’s (1998) theorems. They show 
that even though estimates of treatment effects based on a 
known propensity score are inefficient, for models with con- 
tinuous covariates, a Horvitz-Thompson-type weighting esti- 
mator is efficient when the weighting scheme uses a nonpara- 
metric estimate of the score. The facts that the propensity score 
is estimated and that it is estimated nonparametrically are both 
key for the Hirano, Imbens, and Ridder conclusions. 

Do the Hirano, Imbens, and Ridder (2003) results resolve 
the propensity score paradox? For the moment, we prefer the 
finite-sample resolution given by Angrist and Hahn (2004). 
The latter result highlights the fact that it is the researchers’ 
willingness to impose restrictions on the score that gives 
propensity score-based inference its conceptual and statistical 
power. In Angrist (1998), for example, an application with 
high-dimensional though discrete covariates, the unrestricted 
nonparametric estimator of the score is just the empirical 
probability of treatment in each covariate cell. With this non- 
parametric estimator plugged in for p(X;), it is straightforward 
to show that the sample analogs of (3.3.11) and (3.3.12) are 
algebraically equivalent to the corresponding full-covariate 
matching estimators. Hence, it’s no surprise that score-based 
estimation comes out efficient, since full-covariate matching is 
the asymptotically efficient benchmark. An essential element 
of propensity score methods is the use of prior knowledge for 
dimension reduction. The statistical payoff is an improvement 
in finite-sample behavior. If you’re not prepared to smooth, 
restrict, or otherwise reduce the dimensionality of the match- 
ing problem in a manner that has real empirical consequences, 
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then you might as well go for full covariate matching or 
saturated regression control. 


3.3.3 Propensity Score Methods versus Regression 


Propensity score methods shift attention from the estima- 
tion of E[y;|X;,p;] to the estimation of the propensity score, 
p(X;) = E[p;|X;]. This is attractive in applications where the 
latter is easier to model or motivate. For example, Ashenfelter 
(1978) showed that participants in government-funded train- 
ing programs often have suffered a marked preprogram dip in 
earnings, a pattern found in many later studies. If this dip is the 
only thing that makes trainees special, then we can estimate 
the causal effect of training on earnings by controlling for past 
earnings dynamics. In practice, however, it’s hard to match on 
earnings dynamics since earnings histories are both continu- 
ous and multidimensional. Dehejia and Wahba (1999) argue 
in this context that the causal effects of training programs are 
better estimated by conditioning on the propensity score than 
by conditioning on the earnings histories themselves. 

The propensity score estimates reported by Dehejia and 
Wahba are remarkably close to the estimates from the ran- 
domized trial that constitute their benchmark. Nevertheless, 
we believe regression should be the starting point for most 
empirical projects. This is not a theorem; undoubtedly, there 
are circumstances in which propensity score matching pro- 
vides more reliable estimates of average causal effects. The 
first reason we don’t find ourselves on the propensity score 
bandwagon is practical: there are many details to be filled 
in when implementing propensity score matching, such as 
how to model the score and how to do inference; these 
details are not yet standardized. Different researchers might 
therefore reach very different conclusions, even when using 
the same data and covariates. Moreover, as we’ve seen 
with the Horvitz-Thompson estimands, there isn’t very much 
theoretical daylight between regression and propensity score 
weighting. If the regression model for covariates is fairly flexi- 
ble, say, close to saturated, regression can be seen as a type of 
propensity score weighting, so the difference is mostly in the 
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implementation. In practice you may be far from saturation, 
but with the right covariates this shouldn’t matter. 

The face-off between regression and propensity score match- 
ing is illustrated here using the same National Supported Work 
(NSW) sample featured in Dehejia and Wahba (1999).°° The 
NSW is a mid-1970s program that provided work experience 
to recipients with weak labor force attachment. Somewhat 
unusually for its time, the NSW was evaluated in a randomized 
trial. Lalonde’s (1986) pathbreaking analysis compared the 
results from the NSW randomized study to econometric results 
using nonexperimental control groups drawn from the PSID 
and the CPS. He came away pessimistic because plausible non- 
experimental methods generated a wide range of results, many 
of which were far from the experimental estimates. More- 
over, Lalonde argued, an objective investigator, not knowing 
the results of the randomized trial, would be unlikely to pick 
the best econometric specifications and observational control 
groups. 

In a striking second take on the Lalonde (1986) findings, 
Dehejia and Wahba (1999) found they could come close to 
the NSW experimental results by matching the NSW treat- 
ment group to observational control groups selected using the 
propensity score. They demonstrated this using various com- 
parison groups. Following Dehejia and Wahba (1999), we 
look again at two of the CPS comparison groups, first, a largely 
unselected sample (CPS-1), and then a narrower comparison 
group selected from the recently unemployed (CPS-3). 

Table 3.3.2 (columns 1-4 of which are a replication of 
table 1 in Dehejia and Wahba, 1999) reports descriptive statis- 
tics for the NSW treatment group, the randomly selected NSW 
control group, and our two observational control groups. 
The NSW treatment group and the randomly selected NSW 
control groups are younger, less educated, more likely to be 
nonwhite, and have much lower earnings than the general pop- 
ulation represented by the CPS-1 sample. The CPS-3 sample 
matches the NSW treatment group more closely but still shows 


30A more extended propensity-score face-off appears in the exchange 
between Smith and Todd (2005) and Dehejia (2005). 
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TABLE 3.3.2 
Covariate means in the NSW and observational control samples 
Full Comparison P-Score Screened 
NSW Samples Comparison Samples 

Treated Control CPS-1  CPS-3  CPS-1 CPS-3 
Variable (1) (2) (3) (4) (5) (6) 
Age 25.82 25.05 33.23 28.03 25.63 25.97 
Years of schooling 10.35 10.09 12.03 10.24 10.49 10.42 
Black 84 83 .07 20 96 52 
Hispanic 06 11 .07 14 03 20 
Dropout 71 83 30 .60 .60 63 
Married 19 1S 71 .51 .26 29 
1974 earnings 2,096 2,107 14,017 5,619 2,821 2,969 
1975 earnings 1,532 1,267 13,651 2,466 1,950 1,859 
Number of obs. 185 260 15,992 429 352 157 


Notes: Adapted from Dehejia and Wahba (1999), table 1. The samples in the first 
four columns are as described in Dehejia and Wahba (1999). The samples in the last 
two columns are limited to comparison group observations with a propensity score 
between .1 and .9. Propensity score estimates use all the covariates listed in the table. 


some differences, particularly in terms of race and preprogram 
earnings. 

Table 3.3.3 reports estimates of the NSW treatment effect. 
The dependent variable is annual earnings in 1978, a year or 
two after treatment. Rows of the table show results with alter- 
native sets of controls: none; all the demographic variables in 
table 3.3.2; lagged (1975) earnings; demographics plus lagged 
earnings; demographics and two lags of earnings. All estimates 
are from regressions of 1978 earnings on a treatment dummy 
plus controls (the raw treatment-control difference appears in 
the first row). 

Estimates using the experimental control group, reported in 
column 1, are on the order of $1,600-1,800. Not surprisingly, 
these estimates vary little across specifications. In contrast, the 
raw earnings gap between NSW participants and the CPS-1 
sample, reported in column 2, is roughly —$8,500, suggest- 
ing this comparison is heavily contaminated by selection bias. 
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TABLE 3.3.3 
Regression estimates of NSW training effects 
using alternative controls 


Full Comparison P-Score Screened 
Samples Comparison Samples 
NSW CPS-1 CPS-3 CPS-1 CPS-3 
Specification (1) (2) (3) (4) (5) 


Raw difference 1,794 —8,498 —635 
(633) (712) (657) 


Demographic 1,670 —3,437 771 —3,361 890 
controls (639) (710) (837) (811) (884) 
[139/497] [154/154] 
1975 earnings 1,750 —78 —91 No 166 
(632) (537) (641) obs. (644) 
[0/0] [183/427] 
Demographics, 1,636 623 1,010 1,201 1,050 
1975 earnings (638) (558) (822) (722) (861) 
[149/357] [157/162] 
Demographics, 1,676 794 1,369 1,362 649 
1974 and 1975 (639) (548) (809) (708) (853) 
earnings [151/352] [147/157] 


Notes: The table reports regression estimates of training effects using the 
Dehejia-Wahba (1999) data with alternative sets of controls. The demo- 
graphic controls are age, years of schooling, and dummies for black, Hispanic, 
high school dropout, and married. Standard errors are reported in parenthe- 
ses. Observation counts are reported in brackets [treated/control]. There are 
no observations with an estimated propersity score in the interval [.1, .9] 
using only 1975 earnings as a covariate with CPS-1 data. 


The addition of demographic controls and lagged earnings 
narrows the gap considerably; the estimated treatment effect 
reaches (positive) $800 in the last row. The results are even 
better in column 3, which uses the narrower CPS-3 compar- 
ison group. The characteristics of this group are much closer 
to those of NSW participants; consistent with this, the raw 
earnings difference is only —$635. The fully controlled esti- 
mate, reported in the last row, is close to $1,400, not far from 
the experimental treatment effect. 
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A drawback of the process taking us from CPS-1 to CPS-3 
is the ad hoc nature of the rules used to construct the smaller 
and more carefully selected CPS-3 comparison group. The 
CPS-3 selection criteria can be motivated by the NSW pro- 
gram rules, which favor individuals with low earnings and 
weak labor-force attachment, but in practice, there are many 
ways to implement this. We’d therefore like a more systematic 
approach to prescreening. In a recent paper, Crump, Hotz, 
Imbens, and Mitnik (2009) suggest that the propensity score 
be used for systematic sample selection as a precursor to regres- 
sion estimation. This contrasts with our earlier discussion of 
the propensity score as the basis for an estimator. 

We implemented the Crump et al. (2009) suggestion by first 
estimating the propensity score on a pooled NSW-treatment 
and observational-comparison sample, and then picking only 
those observations with 0.1 < p(X;) < 0.9. In other words, the 
estimation sample is limited to observations with a predicted 
probability of treatment equal to at least 10 percent but no 
more than 90 percent. This ensures that regressions are esti- 
mated in a sample including only covariate cells where there 
are at least a few treated and control observations. Estimation 
using screened samples therefore requires no extrapolation 
to cells without “common support”—in other words, to 
cells where there is no overlap in the covariate distribution 
between treatment and controls. Descriptive statistics for sam- 
ples screened on the score (estimated using the full set of 
covariates listed in the table) appear in the last two columns 
of table 3.3.2. The covariate means in the screened CPS-1 and 
CPS-3 samples are much closer to the NSW means in column 
1 than are the covariate means from unscreened samples. 

We explored the common support screener further using 
alternative sets of covariates, but with the same covariates used 
for both screening and the estimation of treatment effects at 
each iteration. The resulting estimates are displayed in the final 
two columns of table 3.3.3. Controlling for demographic vari- 
ables or lagged earnings alone, these results differ little from 
those in columns 2 and 3. With both demographic variables 
and a single lag of earnings as controls, however, the screened 
CPS-1 estimates are quite a bit closer to the experimental 
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estimates than are the unscreened results. Screened CPS-1 
estimates with two lags of earnings are also close to the exper- 
imental benchmark. On the other hand, the common support 
screener improves the CPS-3 results only slightly with a single 
lag of earnings and seems to be a step backward with two. 

This investigation boosts our (already strong) faith in regres- 
sion. Regression control for the right covariates does a reason- 
ably good job of eliminating selection bias in the CPS-1 sample 
despite a huge baseline gap. Restricting the sample using our 
knowledge of program admissions criteria yields even better 
regression estimates with CPS-3, about as good as Dehejia and 
Wahba’s (1999) propensity score matching results with two 
lags of earnings. Systematic prescreening to enforce common 
support seems like a useful adjunct to regression estimation 
with CPS-1, a large and coarsely selected initial sample. The 
estimates in screened CPS-1 are as good as unscreened CPS-3. 
We note, however, that the standard errors for estimates using 
propensity score-screened samples have not been adjusted to 
reflect the sampling variance in our estimates of the score. 
An advantage of prescreening using prior information, as in 
the step from CPS-1 to CPS-3, is that no such adjustment is 
necessary. 


3.4 Regression Details 


3.4.1 Weighting Regression 


Few things are as confusing to applied researchers as the role 
of sample weights. Even now, 20 years post-Ph.D., we read the 
section of the Stata manual on weighting with some dismay. 
Weights can be used in a number of ways, and how they are 
used may well matter for your results. Regrettably, however, 
the case for or against weighting is often less than clear-cut, as 
are the specifics of how the weights should be programmed. A 
detailed discussion of weighting pros and cons is beyond the 
scope of this book. See Pfefferman (1993) and Deaton (1997) 
for two perspectives. In this brief subsection, we provide a few 
guidelines and a rationale for our approach to weighting. 
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A simple rule of thumb for weighting regression is to use 
weights when they make it more likely that the regression 
you are estimating is close to the population target you are 
trying to estimate. If, for example, the target (or estimand) 
is the population regression function, and the sample to be 
used for estimation is nonrandom with sampling weights, w;, 
equal to the inverse probability of sampling observation i, 
then it makes sense to use weighted least squares, weight- 
ing by w; (for this you can use Stata pweights or a SAS 
weight statement). Weighting by the inverse sampling proba- 
bility generates estimates that are consistent for the population 
regression function even if the sample you have to work with 
is not a simple random sample. 

A related weighting scenario involves grouped data. Sup- 
pose you would like to regress y; on X; in a random sample, 
presumably because you want to learn about the population 
regression vector 8 = E[X;X‘]~'E[X; Y;]. Instead of a random 
sample, however, you have data grouped at the level of Xj. 
That is, you have estimates of E[ y;|X; = x] for each x, esti- 
mated using data from a random sample. Let this average be 
denoted ¥,, and suppose you also know nx, where n/N is 
the relative frequency of the value x in the underlying ran- 
dom sample. As we saw in section 3.1.2, the regression of 
Yx on x, weighted by n, is the same as the random sample 
microdata regression. Therefore, if your goal is to get back to 
the microdata regression, it makes sense to weight by group 
size. We note, however, that macroeconomists, accustomed to 
working with published averages (like per capita income) and 
ignoring the underlying microdata, might disagree, or perhaps 
take the point in principle but remain disinclined to buck tra- 
dition in their discipline, which favors the unweighted analysis 
of aggregate variables. 

On the other hand, if the sole rationale for weighting is het- 
eroskedasticity, as in many textbook discussions of weighting, 
we are even less sympathetic to weighting than the macroe- 
conomists. The argument for weighting under heteroskedas- 
ticity goes roughly like this: suppose you are interested in 
a linear CEF, E[y;|X;] = X;8. The error term, defined as 
ei = Y; — X;ß, may be heteroskedastic. That is, the conditional 


Making Regression Make Sense 93 


variance function E[e?|X;] need not be constant. In this case, 
while the population regression function is still equal to 
E[X;X‘]-'E[X;y;], the sample analog is inefficient. A more 
precise estimator of the linear CEF is WLS—that is, the esti- 
mator that minimizes the sum of squared errors weighted by 
an estimate of E[e?|X;]~!. 

As noted in section 3.1.3, an inherently heteroskedastic sce- 
nario is the LPM, where y; is a dummy variable. Assuming the 
CEF is in fact linear, as it will be if the model is saturated, 
then Ply; = 1|X;] = X/ and therefore E[e?|X;] = X/6(1— 
X/B), which is obviously a function of X;. This isan example of 
model-based heteroskedasticity where estimates of the condi- 
tional variance function are easily constructed from estimates 
of the underlying regression function. The efficient WLS esti- 
mator for the LPM—a special case of generalized least squares 
(GLS)—is to weight by [X/8(1 — X/B)]~!. Because the CEF has 
been assumed to be linear, these weights can be estimated in 
a first pass by OLS. 

There are two reason why we prefer not to weight in this case 
(though we would use heteroskedasticity-consistent standard 
errors). First, in practice, the estimates of E[e?|X;] may not be 
very good. If the conditional variance model is a poor approx- 
imation or if the estimates of it are very noisy, WLS estimates 
may have worse finite-sample properties than unweighted 
estimates. The inferences you draw based on asymptotic the- 
ory may therefore be misleading, and the hoped-for efficiency 
gain may not materialize.*! Second, if the CEF is not linear, 
the WLS estimator is no more likely to estimate it than is the 
unweighted estimator. On the other hand, the unweighted esti- 
mator still estimates something easy to interpret: the MMSE 
linear approximation to the population CEF. 

WLS estimators also provide some sort of approximation, 
but the nature of this approximation depends on the weights. 
At a minimum, this makes it harder to compare your results 
to estimates reported by other researchers, and opens up addi- 
tional avenues for specification searches when results depend 


31 Altonji and Segal (1996) discuss this point in a generalized method-of- 
moments context. 
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on weighting. Finally, an old caution comes to mind: if it 
ain’t broke, don’t fix it. The interpretation of the population 
regression vector is unaffected by heteroskedasticity, so why 
worry about it? Any efficiency gain from weighting is likely to 
be modest, and incorrect or poorly estimated weights can do 
more harm than good. 


3.4.2 Limited Dependent Variables 
and Marginal Effects 


Many empirical studies involve dependent variables that take 
on only a limited number of values. An example is the Angrist 
and Evans (1998) investigation of the effect of childbearing on 
female labor supply, also discussed in the chapter on instru- 
mental variables. This study is concerned with the causal 
effects of childbearing on parents’ work and earnings. Because 
childbearing is likely to be correlated with potential earnings, 
Angrist and Evans report instrumental variables estimates 
based on sibling-sex composition and multiple births, as well 
as OLS estimates. Almost every outcome in this study is either 
binary (e.g., employment status) or non-negative (e.g., hours 
worked, weeks worked, and earnings). Should the fact that a 
dependent variable is limited affect empirical practice? Many 
econometrics textbooks argue that, while OLS is fine for con- 
tinuous dependent variables, when the outcome of interest is a 
limited dependent variable (LDV), linear regression models are 
inappropriate and nonlinear models such as probit and Tobit 
are preferred. In contrast, our view of regression as inheriting 
its legitimacy from the CEF makes LDVness less central. 

As always, a useful benchmark is a randomized experiment, 
where regression generates a simple treatment-control differ- 
ence. Consider, for example, regressions of various outcome 
variables on a randomly assigned regressor that indicates one 
of the treatment groups in the RAND Health Insurance Exper- 
iment (HIE; Manning et al. 1987). In this ambitious experi- 
ment, probably the most expensive in American social science, 
the RAND Corporation set up a small health insurance com- 
pany that charged no premium. Nearly 6,000 participants in 
the study were randomly assigned to health insurance plans 
with different features. 
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One of the most important features of any insurance plan 
is the portion of health care costs the insured individual is 
expected to pay. The HIE randomly assigned individuals to 
many different plans. One plan provided entirely free care, 
while the others included various combinations of copay- 
ments, expenditure caps, and deductibles, so that enrollees 
paid for some of their health care costs out-of-pocket. The 
main purpose of the experiment was to learn whether the use 
of medical care is sensitive to cost and, if so, whether this 
affects health. The HIE results showed that those offered free 
or low-cost medical care used more of it but were not, for 
the most part, any healthier as a result. These findings helped 
pave the way for cost-sensitive health insurance plans and 
managed care. 

Most of the outcomes in the HIE are LDVs. These include 
dummies indicating whether an experimental subject incurred 
any medical expenditures or was hospitalized in a given year, 
and non-negative outcomes such as the number of face-to-face 
doctor visits and gross annual medical expenses (whether paid 
by patient or insurer). The expenditure variable is zero for 
about 20 percent of the sample. Results for two of the HIE 
treatment groups are reproduced in table 3.4.1, derived from 
the estimates reported in table 2 of Manning et al. (1987). 
Table 3.4.1 shows average outcomes in the free care and indi- 
vidual deductible groups. The latter group faced a deductible 
of $150 per person or $450 per family per year for outpatient 
care, after which all costs were covered (there was no charge 
for inpatient care). The overall sample size in these two groups 
was a little over 3,000. 

To simplify the LDV discussion, suppose that the com- 
parison between free care and deductible plans is the only 
comparison of interest and that treatment was determined by 
simple random assignment.** Let D; = 1 denote assignment 
to the deductible group. By virtue of random assignment, the 


32The HIE was considerably more complicated than described here. There 
were 14 different treatments, including assignment to a prepaid HMO-like 
service. The experimental design did not use simple random assignment but 
rather a more complicated stratified assignment scheme meant to ensure 
covariate balance across groups. 
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TABLE 3.4.1 
Average outcomes in two of the HIE treatment groups 


Outpatient Admis- Prob. Any Prob. Any Total 


Face-to- Expenses sions Medical Inpatient Expenses 

Plan Face Visits (1984 $) (%) (%) (%) (1984 $) 
Free 4.55 340 12.8 86.8 10.3 749 
(.17) (10.9) (.7) (.8) (.5) (39) 
Deductible 3.02 235 11. 72.3 9.6 608 
(.17) (11.9) (.8) (1.5) (.6) (46) 
Deductible —1.53 —105 —1.3 —14.5 —0.7 —141 
minus free (.24) (16.1) (1.0) (1.7) (.7) (60) 


Notes: Adapted from Manning et al. (1987), table 2. All standard errors (shown in 
parentheses) are corrected for intertemporal and intrafamily correlations. Amounts are 
in June 1984 dollars. Visits are face-to-face contacts with health providers; visits solely 
for radiology, anesthesiology, or pathology services are excluded. Visits and expenses 
exclude dental care and outpatient psychotherapy. 


difference in means between those with D; = 1 and p; = 0 
gives the unconditional average treatment effect. As in our 
earlier discussion of experiments (chapter 2): 


Ely;|D; = 1] — E[y;|D; = 0] (3.4.1) 
= Efyy|D; = 1] — Elyo;|D; = 1] 
= Ely1i— Yoil 


because D; is independent of potential outcomes. Also, as 
before, E[y;|D; = 1] — E[y;|D; = 0] is the slope coefficient in 
a regression of Y; on Dj. 

Equation (3.4.1) suggests that the estimation of causal 
effects in experiments presents no special challenges whether y; 
is binary, non-negative, or continuously distributed. Although 
the interpretation of the right-hand side changes for different 
sorts of dependent variables, you do not need to do anything 
special to get the average causal effect. For example, one of the 
HIE outcomes is a dummy denoting any medical expenditure. 
Since the outcome here is a Bernoulli trial, we have 


= Plyi; = ]— Plyo; = 1]. (3.4.2) 
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This might affect the language we use to describe results, but 
not the underlying calculation. In the HIE, for example, com- 
parisons across experimental groups, as on the left-hand side 
of (3.4.1), show that 87 percent of those assigned to the free- 
care group used at least some care in a given year, while only 
72 percent of those assigned to the deductible plan used care. 
The relatively modest $150 deductible therefore had a marked 
effect on use of care. The difference between these two rates, 
—.15 is an estimate of E[y1; — Yo;], where yY; is a dummy indi- 
cating any medical expenditure. Because the outcome here is 
a dummy variable, the average causal effect is also a causal 
effect on usage rates or probabilities. 

Recognizing that the medical usage outcome variable is a 
probability, suppose instead that you use probit to fit the CEF 
in this case. No harm in trying! The probit model is usually 
motivated by the assumption that participation is determined 
by a latent variable, y*, that satisfies 


Y7 = Pò + {Di — vi, (3.4.3) 


where vw; is distributed N(0, o7). Note that this latent variable 
cannot be actual medical expenditure since expenditure is non- 
negative and therefore non-normal, while normally distributed 
random variables are continuously distributed on the real line 
and can therefore be negative. Given the latent index model, 


Yy; = l[y; > 0], 
so the CEF for y; can be written 


an 


Oy 


Ely;|Di] = ® | 
where ®[-] is the normal CDF. Therefore 
Epp = |f ]|+ fof fite] JHI 
Oy om Oy 


This is a linear function of the regressor, D;, so the slope coef- 
ficient in the linear regression, of Y; on D; is just the difference 
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in probit fitted values, ® [AF | — of 2]. But the probit coef- 


: 
ficients, fo and 4 = do not give us the size of the effect of D; 


on participation adi we feed them back into the normal CDF 
(though they do have the right sign). Regression, in contrast, 
gives us what we need with or without the probit distributional 
assumptions. 

One of the most important outcomes in the HIE is gross 
medical expenditure, in other words, health care costs. Did 
subjects who faced a deductible use less care, as measured by 
the cost? In the HIE, the average difference in expenditures 
between the deductible and free-care groups was —141 dol- 
lars, about 19 percent of the expenditure level in the free-care 
group. This calculation suggests that making patients pay a 
portion of costs reduces expenditures quite a bit, though the 
estimate is not very precise. 

Because expenditure outcomes are non-negative random 
variables, and sometimes equal to zero, their expectation can 
be written 


E[y;lD;] = Ely,ly; > 0,D;]P[y; > 0ļ|D;]. 


The difference in expenditure outcomes across treatment 
groups is 
E[Y;|D; = 1] — E[Y;|D; = 0] (3.4.4) 
= E[y;|y; > 0, D; = 1]P[y; > 0|p; = 1] 
— Efy;ly; > 0, Dp; = O]P[y; > 0|p; = 0] 
= {P[y; > 0|p; = 1] — Ply; > 0|p; = OJ}E[y;ly; > 0, D; = 1] 
Participation effect 
+ lyy; > 0, D; = 1] — Efy;ly; > 0, D; = OF} 


COP effect 
x Ply; > 0|p; = 0]. 


So, the overall difference in average expenditure can be bro- 
ken up into two parts: the difference in the probability that 
expenditures are positive (often called a participation effect) 
and the difference in means conditional on participation, a 
conditional-on-positive (COP) effect. Again, however, this has 
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no special implications for the estimation of causal effects; 
equation (3.4.1) remains true: the regression of Y; on D; gives 
the unconditional average treatment effect for expenditures. 


Good COP, Bad COP: Conditional-on-Positive Effects 


Because causal effects on a non-negative random variable such 
as expenditure have two parts, some applied researchers feel 
they should look at these parts separately. In fact, many use 
a two-part model, in which the first part is an evaluation of 
the effect on participation and the second part looks at COP 
effects (see, e.g., Duan et al., 1983 and 1984, for such models 
applied to the HIE). The first part of (3.4.4) raises no special 
issues, because, as noted above, the fact that y; is a dummy 
means only that average treatment effects are also differences 
in probabilities. The problem with the two-part model is that 
the COP effects do not have a causal interpretation, even in 
a randomized trial. This complication can be understood as 
the same selection problem described in section 3.2.3, on bad 
control. 
To analyze the COP effect further, write 


E[y;ly; > 0,D; = 1] — Ely;l|y; > 0, D; = 0] (3.4.5) 
= Efyijly1; > 0] — Elvoilyo; > 0] 
= Efyii—Yoil¥1; > 0] + {Elvoily1i > 0] — Elvoilyo; > O}}, 
Se A 


Causal effect Selection bias 


where the second line uses the random assignment of D;. This 
decomposition shows that the COP effect is composed of two 
terms: a causal effect for the subpopulation that uses medical 
care with a deductible and the difference in Yo; between those 
who use medical care when they have to pay something and 
when it is free. This second term is a form of selection bias, 
though it is more subtle than the selection bias in chapter 2. 
Here selection bias arises because the experiment changes 
the composition of the group with positive expenditures. The 
Yo; > 0 population probably includes some low-cost users who 
would opt out of care if they had to pay a deductible. In other 
words, this group is larger and probably has lower costs on 
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average than the Y1; > 0 group. The selection bias term is 
therefore positive, with the result that COP effects are closer 
to zero than the presumably negative causal effect, E[Y1; — 
Yoil¥1; > 0]. This is a version of the bad control problem from 
section 3.2.3: in a causal effects setting, y; > 0 is an outcome 
variable and therefore unkosher for conditioning unless the 
treatment has no effect on the likelihood that y; is positive. 

One resolution of the noncausality of COP effects relies on 
censored regression models like Tobit. These models postu- 
late a latent expenditure outcome for nonparticipants (e.g., 
Hay and Olsen, 1984). A traditional Tobit formulation for 
the expenditure problem stipulates that the observed y; is 
generated by 


y; = lly; > Oly;, 


where Y* is a normally distributed latent expenditure variable 
that can take on negative values. Because y* is not an LDV, 
Tobit proponents feel comfortable linking this to D; using a 
traditional linear model, say, equation (3.4.3). In this case, 
B; is the causal effect of D; on latent expenditure, y*. This 
equation is defined for everyone, whether Y; is positive or not. 
There is no COP-style selection problem if we are happy to 
study effects on Y*. 

But we are not happy with effects on y*. The first problem is 
that “latent health care expenditure” is a puzzling construct. 
Health care expenditure really is zero for some people; this is 
not a statistical artifact or due to some kind of censoring. So the 
notion of latent and potentially negative y* is hard to grasp. 
There are no data on yj and there never will be. A second 
problem is that the link between the parameter fF in the latent 
model and causal effects on the observed outcome, Y;, turns on 
distributional assumptions about the latent variable. To estab- 
lish this link we evaluate the expectation of Y; given D; to find 


Bo + ByDi 


+o |f: =]. 


Oy 


E[y;|Dj] = ® | | [65 + LiD] 


(3.4.6) 
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(see, e.g., McDonald and Moffitt, 1980). This expression is 
derived using the normality and homoskedasticity of v; and 
the assumption that y; can be represented as 1[y; > O]y*. 
The Tobit CEF provides us with an expression for the 
average treatment effect on observed expenditure. Specifically, 


E[y;|Dj = 1] — E[y;|D; = 0] 
= fo] ee" pitsio | EEN 


v v 


= fo [£] [Bl+ oo [£] (3.4.7) 
om Oy 


a rather daunting formula. But since the only regressor is a 
dummy variable, D;, none of this is necessary for the estimation 
of E[y;|D; = 1] — E[y;|D; = 0]. The slope coefficient from an 
OLS regression of Y; on D; recovers the CEF difference on the 
left-hand side of (3.4.7) whether or not you adopt a Tobit 
model to explain the underlying structure.*? 

COP effects are sometimes motivated by a researcher’s sense 
that when the outcome distribution has a mass point—that is, 
when it piles up on a particular value, such as zero—or has a 
heavily skewed distribution, or both, then an analysis of effects 
on averages misses something. Analyses of effects on averages 
indeed miss some things, such as changes in the probability of 
specific values or a shift in quantiles away from the median. 
But why not look at these distribution effects directly? Dis- 
tribution outcomes include the likelihood that annual medical 
expenditures exceed zero, 100 dollars, 200 dollars, and so on. 
In other words, put 1[y; > c] for different choices of c on the 
left-hand side of the regression of interest. Econometrically, 
these outcomes are all in the category of (3.4.2). The idea of 
looking directly at distribution effects with linear probabil- 
ity models is illustrated by Angrist (2001), in an analysis of 
the effects of childbearing on hours worked. Alternatively, if 


33A generalization of Tobit is the sample selection model, where the latent 
variable determining participation differs from the latent expenditure variable. 
See, for example, Maddala (1983). The same conceptual problems related to 
the interpretation of effects on latent variables arise in the sample selection 
model as with Tobit. 
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quantiles provide a focal point, we can use quantile regression 
to model them. Chapter 7 discusses this idea in detail. 

Do Tobit-type latent variable models ever make sense? Yes, 
if the data you are working with are truly censored. True cen- 
soring means the latent variable has an empirical counterpart 
that is the outcome of primary interest. A leading example 
from labor economics is CPS earnings data, which topcodes 
(censors) very high values of earnings to protect respondent 
confidentiality. Typically, we’re interested in the causal effect 
of schooling on earnings as it appears on respondents’ tax 
returns, not their CPS-topcoded earnings. Chamberlain (1994) 
shows that in some years, CPS topcoding reduces the measured 
returns to schooling considerably, and proposes an adjustment 
for censoring based on a Tobit-style adaptation of quantile 
regression. The use of quantile regression to model censored 
data is also discussed in chapter 7.°4 


Covariates Lead to Nonlinearity 


True censoring as with the CPS topcode is rare, a fact that 
leaves limited scope for constructive applications of Tobit- 
type models in applied work. At this point, however, we 
have to hedge a bit. Part of the neatness in the discussion 
of experiments comes from the fact that E[y;|D,] is necessar- 
ily a linear function of D;, so that regression and the CEF are 
one and the same. In fact, this CEF is linear for any func- 
tion of y;, including the distribution indicators, 1[y; > c]. In 
practice, of course, the explanatory variable of interest isn’t 
always a dummy, and there are usually additional covariates 
in the CEF, in which case E[y;|X;,D;] for LDVs is almost cer- 
tainly nonlinear. Intuitively, as predicted means get close to 
the dependent variable boundaries, the derivatives of the CEF 


34We should note that our favorite regression example, a regression of log 
wages on schooling, may have a COP problem since the sample of log wages 
naturally omits those with zero earnings. This leads to COP-style selection 
bias if education affects the probability of working. In practice, therefore, 
we focus on samples of prime-age males, whose participation rates are high 
and reasonably stable across schooling groups (e.g., white men aged 40-49 in 
figure 3.1.1). 
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for LDVs get smaller (think, for example, of how the normal 
CDF flattens at extreme values). 

The upshot is that in LDV models with covariates, regres- 
sion need not fit the CEF perfectly. It remains true, however, 
that the underlying CEF has a causal interpretation if the 
CIA holds. And if the CEF has a causal interpretation, it 
seems fair to say that regression has a causal interpretation 
as well, because it still provides the MMSE approximation to 
the CEF. Moreover, if the model for covariates is saturated, 
then regression also estimates a weighted average treatment 
effect similar to (3.3.1) and (3.3.3). Likewise, if the regressor 
of interest is multivalued or continuous, we get a weighted 
average derivative, as described by the formulas at the end of 
subsection 3.3.1. 

And yet, we may not have enough data for the saturated- 
covariate regression specification to be very attractive. Regres- 
sion will therefore miss some features of the CEF. For one 
thing, it may generate fitted values outside the LDV bound- 
aries. This fact bothers some researchers and has generated a 
lot of bad press for the linear probability model. One attrac- 
tive feature of nonlinear models like probit and Tobit is that 
they produce CEFs that respect LDV boundaries. In partic- 
ular, probit fitted values are always between zero and one, 
while Tobit fitted values are positive (this is not obvious from 
equation (3.4.6)). We might therefore prefer nonlinear models 
on simple curve-fitting grounds. 

Point conceded. It’s important to emphasize, however, that 
the output from nonlinear models must be converted into 
marginal effects to be useful. Marginal effects are the (aver- 
age) changes in CEF implied by a nonlinear model. Without 
marginal effects, it’s hard to talk about the impact on observed 
dependent variables. If we continue to assume the regressor of 
interest is the dummy variable, p;, marginal effects can be 
constructed either by differencing 


E{E[y;|X;,D; = 1] — E[y;|X;, D; = 0J}, 


JELY{|Xj,Di] 
OD; 
tives when dealing with continuous or multivalued regressors. 


or, by differentiation, E{ }. Most people use deriva- 
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How close do OLS regression estimates come to the 
marginal effects induced by a nonlinear model like probit or 
Tobit? We first derive the marginal effects, and then show 
an empirical example. The probit CEF for a model with 
covariates is 


xX * * i 
E[y;|Xi, Dj] = ® Feed . 


Oy 
The average finite difference is therefore 


efo [5+4] 0 [=| . (3.4.8) 


Oy Oy 


In practice, this can be approximated by the average derivative, 


efo [5At]. (£) 


(Stata computes marginal effects both ways but defaults to 
(3.4.8) for dummy regressors). 

Similarly, generalizing equation (3.4.6) to a model with 
covariates, we have 


X; pò + BED; 7 
Xfi tein] [X65 + Bf Dil 


X59 + fie | 


Oy 


Ely;|Xi,D;] = ® | 


ta 


for a non-negative LDV. Tobit marginal effects are almost 
always cast in terms of the average derivative, which can be 
shown to be the surprisingly simple expression 


efo ZEE]. g; (3.4.9) 


Oy 


(see, e.g., Wooldridge, 2006). One immediate implication 
of (3.4.9) is that the Tobit coefficient, By, is always too big 
relative to the effect of D; on y;. Intuitively, this is because, 
given the linear model for latent y*, the latent outcome always 
changes when D; switches on or off. But real y; need not 
change: for many people, it’s zero either way. 
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Table 3.4.2 compares OLS estimates and nonlinear marginal 
effects for regressions of female employment and hours of 
work, both LDVs, on measures of fertility. These estimates 
were constructed using one of the 1980 census samples used 
by Angrist and Evans (1998). This sample includes married 
women aged 21-35 with at least two children. The childbear- 
ing variables consist of a dummy indicating women with more 
than two children or the total number of births. The covari- 
ates include linear terms in mother’s age, age at first birth, 
race dummies (black and Hispanic), and mother’s education 
(dummies for high school graduates, some college, and college 
graduates). The covariate model is not saturated; rather, there 
are additive terms and no interactions, though the underlying 
CEF in this example is surely nonlinear. 

Probit marginal effects for the impact of a dummy vari- 
able indicating more than two children are indistinguishable 
from OLS estimates of the same relation. This can be seen in 
columns 2, 3, and 4 of table 3.4.2, the first row of which com- 
pares the estimates from different methods for the full 1980 
sample. The OLS estimate of the effect of a third child is —.162, 
while the corresponding probit marginal effects are —.163 
and —.162. These were estimated using (3.4.8) in the first 
case and 

Dj = i} 


in the second (hence, a marginal effect on the treated). 

Tobit marginal effects for the relation between fertility and 
hours worked are quite close to the corresponding OLS esti- 
mates, though not indistinguishable. This can be seen in 
columns 5 and 6. Compare, for example, the Tobit estimates 
of —6.56 and —5.87 with the OLS estimate of —5.92 in column 
2. Although one Tobit estimate is 10 percent larger in abso- 
lute value, this seems unlikely to be of substantive importance. 
The remaining columns of the table compare OLS estimates to 
marginal effects for an ordinal childbearing variable instead 
of a dummy. These calculations all use derivatives to com- 
pute marginal effects (labeled MFX). Here, too, the OLS and 


efo [2] © ES 


Oy Oy 


TABLE 3.4.2 


Comparison of alternative estimates of the effect of childbearing on LDVs 


Right-Hand-Side Variable 


More than Two Children Number of Children 
Probit Tobit Probit MFX Tobit MFX 
Avg. Avg. 
Effect, Avg. Effect, Avg. Avg. Effect, Avg. Avg. 
Full Effect on Full Effect on Full Effect, Full Effect on 
Mean OLS Sample Treated Sample Treated OLS Sample Sample Treated 
Dependent Variable (1) 2) (3) (4) (5) (6) (7) (8) (9) (10) 
A. Full sample 
Employment .528 .162 .163 .162 .113 —.114 = < 
(.499) .002) (.002) (.002) (.001) (.001) 
Hours worked 16.7 —5.92 6.56 5.87 —4.07 — —4.66 —4.23 
(18.3) .074) (.081) (.073) (.047) (.054) (.049) 
B. Nonwhite college attenders over age 30, first birth before age 20 
Employment .832 .061 .064 .070 054 —.048 = = 
(.374) .028) (.028) (.031) (.016) (.013) 
Hours worked 30.8 —4.69 4.97 4.90 2.83 — —3.20 —3.15 
(16.0) (1.18) (1.33) (1.31) (.645) (.670) (.659) 


Notes: The table reports OLS estimates, average treatment effects, and marginal effects (MFX) for the effect of childbearing on mothers’ 
labor supply. The sample in panel A includes 254,654 observations and is the same as the 1980 census sample of married women used 
by Angrist and Evans (1998). Covariates include age, age at first birth, and dummies for boys at first and second birth. The sample in 
panel B includes 746 nonwhite women with at least some college aged over 30 whose first birth was before age 20. Standard deviations 
are reported in parentheses in column 1. Standard errors are shown in parentheses in other columns. The sample used to estimate average 
effects on the treated in columns 4, 6, and 10 includes women with more than two children. 
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nonlinear marginal effects estimates are similar for both probit 
and Tobit. 

It is sometimes said that probit models can be expected to 
generate marginal effects close to OLS when the predicted 
probabilities are close to .5 because the underlying nonlinear 
CEF is roughly linear in the middle. With predictions close to 
zero or one, however, we might expect a larger gap. We there- 
fore replicated the comparison of OLS and marginal effects 
in a subsample with relatively high average employment rates, 
nonwhite women over age 30 who attended college and whose 
first birth was before age 20. Although the average employ- 
ment rate is 83 percent in this group, the OLS estimates and 
marginal effects are again similar. 

The upshot of this discussion is that while a nonlinear model 
may fit the CEF for LDVs more closely than a linear model, 
when it comes to marginal effects, this probably matters little. 
This optimistic conclusion is not a theorem, but, as in the 
empirical example here, it seems to be fairly robustly true. 

Why, then, should we bother with nonlinear models and 
marginal effects? One answer is that the marginal effects are 
easy enough to compute now that they are automated in pack- 
ages like Stata. But there are a number of decisions to make 
along the way (e.g., the weighting scheme, derivatives versus 
finite differences), while OLS is standardized. Nonlinear life 
also gets considerably more complicated when we work with 
instrumental variables and panel data. Finally, extra com- 
plexity comes into the inference step as well, since we need 
standard errors for marginal effects. The principle of Occam’s 
razor advises, “Entities should not be multiplied unnecessar- 
ily.” In this spirit, we quote our former teacher, Angus Deaton 
(1997), pondering the nonlinear regression function generated 
by Tobit-type models: 


Absent knowledge of F [the distribution of the errors], this 
regression function does not even identify the 6’s [Tobit 
coefficients] —see Powell (1989)—but more fundamentally, we 
should ask how it has come about that we have to deal with 
such an awkward, difficult, and non-robust object. 
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3.4.3 Why Is Regression Called Regression, and 
What Does Regression to the Mean Mean? 


The term regression originates with Francis Galton’s (1886) 
study of height. Galton, who is pictured visiting his tailor 
on page 26, worked with samples of roughly normally dis- 
tributed data on parents and children. He noted that the CEF 
of a child’s height conditional on his parents’ height is linear, 
with parameters given by the bivariate regression slope and 
intercept. Since height is stationary (its distribution does not 
change much over time), the bivariate regression slope is also 
the correlation coefficient, that is, between zero and one. 

The single regressor in Galton’s setup, x;, is average par- 
ent height and the dependent variable, y;, is the height of 
adult children. The regression slope coefficient, as always, is 
pi = See and the intercept is a = E[y;] — 6, E[X;]. But 
because height is not changing across generations, the mean 
and variance of y; and x; are the same. Therefore, 


as Cov(Y; xi) _ Cov(Yi xi) j 

V(x) VV) 

a = Ely;]— Bi ELXi] = u(1 — £1) = “(1 — pry), 
where pxy is the intergenerational correlation coefficient in 
height and u = E[y;] = E[X;] is the population average height. 
From this we get the linear CEF 


Elyi|xi] = u(1 — pry) + Pxy%i, 


so the height of a child given his parents’ height is a weighted 
average of his parents’ height and the population average 
height. The child of tall parents will therefore not be as tall 
as they are, on average. Likewise, for the short. To be spe- 
cific, Pischke, who is six feet three inches tall, can expect his 
children to be tall, though not as tall as he is. Thankfully, 
however, Angrist, who is five feet six inches tall, can expect 
his children to be taller than he is. Galton called this property 
“regression toward mediocrity in hereditary stature.” Today 
we call it regression to the mean. 
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Galton, who was Charles Darwin’s cousin, is also remem- 
bered for having founded the Eugenics Society, dedicated 
to breeding better people. Indeed, his interest in regression 
came largely from this quest. We conclude from this that the 
value of scientific ideas should not be judged by their author’s 
politics. 

Galton does not seem to have shown much interest in 
multiple regression, our chief concern in this chapter. The 
regressions in Galton’s work are mechanical features of distri- 
butions of stationary random variables; they work just as well 
for the regression of parents’ height on childrens’ height and 
are certainly not causal. Galton would have said so himself, 
because he objected to the Lamarckian idea (later promoted 
in Stalin’s Russia) that acquired traits can be inherited. 

The idea that regression can be used for statistical control 
in pursuit of causality satisfyingly originates in an inquiry into 
the determinants of poverty rates by George Udny Yule (1899). 
Yule, a statistician and student of Karl Pearson (Pearson was 
Galton’s protégé), realized that Galton’s regression coefficient 
could be extended to multiple variables by solving the least 
squares normal equations that had been derived long before by 
Legendre and Gauss. Yule’s (1899) paper appears to be the first 
publication containing multivariate regression estimates. His 
model links changes in poverty rates in an area to changes in 
the local administration of the English Poor Laws, while con- 
trolling for population growth and the age distribution in the 
area. He was particularly interested in whether out-relief, the 
practice of providing income support for poor people without 
requiring them to move to the poorhouse, did not itself con- 
tribute to higher poverty rates. This is a well-defined causal 
question of a sort that still occupies us today.*> 

Finally, we note that the history of regression is beauti- 
fully detailed in the book by Steven Stigler (1986). Stigler is a 
famous statistician at the University of Chicago, but not quite 


35Yule’s first applied paper on the poor laws was published in 1895 in the 
Economic Journal, where Pischke is proud to serve as co-editor. The theory 
of multiple regression that goes along with this appears in Yule (1897). 
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as famous as his father, the economist and Nobel laureate, 
George Stigler. 


3.5 Appendix: Derivation of the Average Derivative 
Weighting Function 


Begin with the regression of Y; on §;: 


Cov(y;, Si) E[h(s; Hs = aMi 


V(s;)  Efs;(s;— Efs;])] 


Let k-% = „im h(t), which we assume exists. By the funda- 
—>—00 


mental theorem of calculus, we have: 


h(S;) = Koo +f" h'(t)dt 


Substituting for h(s;), the numerator becomes 


+00 
E[h(s;)(s; — E m= fo f ro \(u — Els;I)glu)dtdu, 


where g(u) is the density of s; at u. Reversing the order of 
integration, we have 


+00 +00 
EPS ETS f h(t) / (u — E[si])g(u)dudt. 


(oe) 


The inner integral is equal to u: = {E[s;|s; > t] — E[s;|s; < t]} 
{P(s; > t)[1 — P(s; > t)}, the weighting function in (3.3.9), 
which is clearly non-negative. Setting s; = y;, the denominator 
can similarly be shown to be the integral of these weights. We 
therefore have a weighted average derivative representation of 


the bivariate regression coefficient, Souvaess) . A similar formula 


for a regression with covariates is T in the appendix to 
Angrist and Krueger (1999). 
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Chapter 4 


Instrumental Variables in Action: 
Sometimes You Get What You Need 


hl 


Anything that happens, happens. 

Anything that, in happening, causes something else to happen, 

causes something else to happen. 

Anything that, in happening, 

causes itself to happen again, happens again. 

It doesn’t necessarily do it in chronological order, though. 
Douglas Adams, Mostly Harmless 


wo things distinguish the discipline of econometrics from 
the older sister field of statistics. One is a lack of shy- 
ness about causality. Causal inference has always been 
the name of the game in applied econometrics. Statistician 
Paul Holland (1986) cautions that there can be “no causa- 
tion without manipulation,” a maxim that would seem to rule 
out causal inference from nonexperimental data. Less thought- 
ful observers fall back on the truism that “correlation is not 
causality.” Like most people who work with data for a living, 
we believe that correlation can sometimes provide pretty good 
evidence of a causal relation, even when the variable of interest 
has not been manipulated by a researcher or experimenter. ! 
The second thing that distinguishes us from most 
statisticians—and indeed from most other social scientists— 
is an arsenal of statistical tools that grew out of early 


Recent years have seen an increased willingness by statisticians to discuss 
statistical models for observational data in an explicitly causal framework; 
see, for example, Freedman’s (2005) review. 
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econometric research on the problem of how to estimate the 
parameters in a system of linear simultaneous equations. The 
most powerful weapon in this arsenal is the method of instru- 
mental variables (IV), the subject of this chapter. As it turns 
out, the IV method does more than allow us to consistently 
estimate the parameters in a system of simultaneous equations, 
though it allows us to do that as well. 

Studying agricultural markets in the 1920s, the father-and- 
son research team of Phillip and Sewall Wright were interested 
in a challenging problem of causal inference: how to estimate 
the slope of supply and demand curves when observed data 
on prices and quantities are determined by the intersection 
of these two curves. In other words, equilibrium prices and 
quantities—the only ones we get to observe—solve these two 
stochastic equations at the same time. On which curve, there- 
fore, does the observed scatterplot of prices and quantities lie? 
The fact that population regression coefficients do not capture 
the slope of any one equation in a set of simultaneous equa- 
tions had been understood by Phillip Wright for some time. 
The IV method, first laid out in Wright (1928), solves the 
statistical simultaneous equations problem by using variables 
that appear in one equation to shift this equation and trace 
out the other. The variables that do the shifting came to be 
known as instrumental variables (Reiersol, 1941). 

In a separate line of inquiry, IV methods were pioneered to 
solve the problem of bias from measurement error in regression 
models.” One of the most important results in the statistical 
theory of linear models is that a regression coefficient is biased 
toward zero when the regressor of interest is measured with 
random errors (to see why, imagine the regressor contains only 
random error; then it will be uncorrelated with the dependent 
variable, and hence the regression of y; on this variable will be 
zero). Instrumental variables methods can be used to eliminate 
this sort of bias. 

Simultaneous equations models (SEMs) have been enor- 
mously important in the history of econometric thought. At 


2Key historical references here are Wald (1940) and Durbin (1954), both 
discussed later in this chapter. 
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the same time, few of today’s most influential applied papers 
rely on an orthodox SEM framework, though the technical 
language used to discuss IV methods still comes from this 
framework. Today, we are more likely to find IV methods 
used to address measurement error problems than to estimate 
the parameters of an SEM. Undoubtedly, however, the most 
important contemporary use of IV methods is to solve the 
problem of omitted variables bias (OVB). IV methods solve the 
problem of missing or unknown control variables, much as a 
randomized trial obviates extensive controls in a regression.’ 


4.1 IV and Causality 


We like to tell the IV story in two iterations, first in a 
restricted model with constant effects, then in a framework 
with unrestricted heterogeneous potential outcomes, in which 
case causal effects must also be heterogeneous. The introduc- 
tion of heterogeneous effects enriches the interpretation of IV 
estimands without changing the mechanics of the core statis- 
tical methods we are most likely to use in practice (typically, 
two-stage least squares, or 2SLS). An initial focus on con- 
stant effects allows us to explain the mechanics of IV with a 
minimum of fuss. 

To motivate the constant effects setup as a framework for 
the causal link between schooling and wages, suppose, as 
before, that potential outcomes can be written 


Ysi = fils), 


and that 
fils) = a+ ps + ni, (4.1.1) 


as in the discussion of regression and causality in section 3.2. 
Also, as in the earlier discussion, we imagine that there is a 


3See Angrist and Krueger (2001) for a brief exposition of the history and 
uses of IV, Stock and Trebbi (2003) for a detailed account of the birth of IV, 
and Morgan (1990) for an extended history of econometric ideas, including 
the simultaneous equations model. 
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vector of control variables, A;, called “ability,” that gives a 
selection-on-observables story: 


ni = Ajy + vi, 


where y is again a vector of population regression coefficients, 
so that v; and A; are uncorrelated by construction. For now, 
the variables A;, are assumed to be the only reason why n; and 
s; are correlated, so that 


E[s;v;] = 0. 


In other words, if A; were observed, we would be happy to 
include it in the regression of wages on schooling; thereby 
producing a long regression that can be written 


Y; =& + pSi + Aly + vi. (4.1.2) 


Equation (4.1.2) is a version of the linear causal 
model (3.2.9). The error term in this equation is the random 
part of potential outcomes, v;, left over after controlling for 
Aj. This error term is uncorrelated with schooling by assump- 
tion. If this assumption turns out to be correct, the population 
regression of Y; on s; and A; produces the coefficients in (4.1.2). 

The problem we initially want to tackle is how to esti- 
mate the long regression coefficient, p, when A; is unobserved. 
Instrumental variables methods can be used to accomplish this 
when the researcher has access to a variable (the instrument, 
which we’ll call z;), that is correlated with the causal variable 
of interest, s;, but uncorrelated with any other determinants of 
the dependent variable. Here, the phrase “uncorrelated with 
any other determinants of the dependent variables” is like say- 
ing Cov(n;,z;) = 0, or, equivalently, z; is uncorrelated with 
both A; and v;. This statement is called an exclusion restric- 
tion, since Z; can be said to be excluded from the causal model 
of interest. 

Given the exclusion restriction, it follows from (4.1.2) that 


_ Cov(Yi zi) _ Cov(yi,Zi)/V(Zi) 
SS Cov(Si, Zi) — Cov(s;, zi)/ V(z;)` 


The second equality in (4.1.3) is useful because it’s usually 
easier to think in terms of regression coefficients than in terms 


(4.1.3) 
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of covariances. The coefficient of interest, p, is the ratio of the 
population regression of Y; on z; (called the reduced form) to 
the population regression of s; on z; (called the first stage). The 
IV estimator is the sample analog of expression (4.1.3). Note 
that the IV estimand is predicated on the notion that the first 
stage is not zero, but this is something you can check in the 
data. Asa rule, if the first stage is only marginally significantly 
different from zero, the resulting IV estimates are unlikely to 
be informative, a point we return to later. 

It’s worth recapping the assumptions needed for the ratio 
of covariances in (4.1.3) to equal the casual effect, p. First, 
the instrument must have a clear effect on s;. This is the first 
stage. Second, the only reason for the relationship between y; 
and z; is the first stage. For the moment, we’re calling this 
second assumption the exclusion restriction, though as we’ll 
see in the discussion of models with heterogeneous effects, this 
assumption really has two parts: the first is the statement that 
the instrument is as good as randomly assigned (i.e., indepen- 
dent of potential outcomes, conditional on covariates, like the 
CIA in chapter 3), and the second is that the instrument has no 
effect on outcomes other than through the first-stage channel. 

So, where can you find an instrumental variable? Good 
instruments come from a combination of institutional knowl- 
edge and ideas about the processes determining the variable 
of interest. For example, the economic model of education 
suggests that schooling decisions are based on the costs and 
benefits of alternative choices. Thus, one possible source of 
instruments for schooling is differences in costs due to loan 
policies or other subsidies that vary independently of ability 
or earnings potential. A second source of variation in school- 
ing is institutional constraints. A set of institutional constraints 
relevant for schooling is compulsory schooling laws. Angrist 
and Krueger (1991) exploit the variation induced by com- 
pulsory schooling in a paper that typifies the use of “natural 
experiments” to try to eliminate OVB. 

The starting point for the Angrist and Krueger (1991) 
quarter-of-birth strategy is the observation that most states 
require students to enter school in the calendar year in which 
they turn 6. School start age is therefore a function of date 
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of birth. Specifically, those born late in the year are young 
for their grade. In states with a December 31 birthday cutoff, 
children born in the fourth quarter enter school shortly before 
they turn 6, while those born in the first quarter enter school at 
around age 65. Furthermore, because compulsory schooling 
laws typically require students to remain in school only until 
their 16th birthday, these groups of students will be in dif- 
ferent grades, or through a given grade to a different degree, 
when they reach the legal dropout age. The combination of 
school start-age policies and compulsory schooling laws cre- 
ates a natural experiment in which children are compelled to 
attend school for different lengths of time, depending on their 
birthdays. 

Angrist and Krueger looked at the relationship between 
educational attainment and quarter of birth using U.S. cen- 
sus data. Panel A of figure 4.1.1 (adapted from Angrist and 
Krueger, 1991) displays the education quarter-of-birth pattern 
for men in the 1980 census who were born in the 1930s. The 
figure clearly shows that men born earlier in the calendar year 
tended to have lower average schooling levels. Panel A of fig- 
ure 4.1.1 is a graphical depiction of the first stage. The first 
stage in a general IV framework is the regression of the causal 
variable of interest on covariates and instruments. The plot 
summarizes this regression because average schooling by year 
and quarter of birth is what you get for fitted values from a 
regression of schooling on a full set of year-of-birth dummies 
(covariates) and quarter-of-birth dummies (instruments). 

Panel B of figure 4.1.1 displays average earnings by quar- 
ter of birth for the same sample used to construct panel A. 
This panel illustrates the reduced-form relationship between 
the instruments and the dependent variable. The reduced form 
is the regression of the dependent variable on any covari- 
ates in the model and the instruments. Panel B shows that 
older cohorts tend to have higher earnings, because earnings 
rise with work experience. The figure also shows that men 
born in early quarters almost always earned less, on average, 
than those born later in the year, even after adjusting for 
year of birth, a covariate in the Angrist and Krueger (1991) 
setup. Importantly, this reduced-form relation parallels the 
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Figure 4.1.1 Graphical depiction of the first stage and reduced 
form for IV estimates of the economic return to schooling using 
quarter-of-birth instruments (from Angrist and Krueger, 1991). 


quarter-of-birth pattern in schooling, suggesting the two pat- 
terns are closely related. Because an individual’s date of birth 
is probably unrelated to his or her innate ability, motivation, 
or family connections, it seems credible to assert that the only 
reason for the up-and-down quarter-of-birth pattern in earn- 
ings is the up-and-down quarter-of-birth pattern in schooling. 
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This is the critical assumption that drives the quarter-of-birth 
IV story.* 

A mathematical representation of the story told by fig- 
ure 4.1.1 comes from the first-stage and reduced-form regres- 
sion equations, spelled out below: 


S; = Ximo + 711Z; + £1; (4.1.4a) 
Y; = Xjm29 + 9212; + £i. (4.1.4b) 


The parameter 71, in equation (4.1.4a) captures the first-stage 
effect of z; on s;, adjusting for covariates, X;. The parame- 
ter 721 in equation (4.1.4b) captures the reduced-form effect 
of z; on Y;, adjusting for these same covariates. In Angrist 
and Krueger (1991), the instrument z; is quarter of birth (or 
a dummy indicating quarter of birth) and the covariates are 
dummies for year of birth and state of birth. In the language 
of the SEM, the dependent variables in these two equations are 
said to be the endogenous variables (determined jointly within 
the system), while the variables on the right-hand side are said 
to be the exogenous variables (determined outside the system). 
The instruments z; are a subset of the exogenous variables. 
The exogenous variables that are not instruments are said to 
be exogenous covariates. Although we’re not estimating a tra- 
ditional supply-and-demand system in this case, these SEM 
variable labels are still widely used in empirical practice. 

The covariate-adjusted IV estimator is the sample analog of 
the ratio 4. To see this, note that the denominators of the 


reduced-form and first-stage coefficients are the same. Hence, 
their ratio is 


m21 _ Cov(Y;, Zi) 


p= (4.1.5) 


mı  Cov(s;, ži)? 


4Other explanations are possible, the most likely being some sort of family 
background effect associated with season of birth (see, e.g., Bound, Jaeger, and 
Baker, 1995). Weighing against the possibility of omitted family background 
effects is the fact that the quarter-of-birth pattern in average schooling is most 
pronounced at the schooling levels most affected by compulsory attendance 
laws. 
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where Z; is the residual from a regression of z; on the exoge- 
nous covariates, X;. The right-hand side of (4.1.5) therefore 
swaps 2; for z; in the IV formula, (4.1.3). Econometricians call 
the sample analog of equation (4.1.5) an indirect least squares 
(ILS) estimator of p in the causal model with covariates, 


Y; = a'X;4+ ps; + ni, (4.1.6) 


where n; is the compound error term, Aly + 1;. It’s easy to 
use equation (4.1.6) to confirm directly that Cov(y;,Z;) = 
pCov(s;, Zi), since Z; is uncorrelated with X; by construction 
and with n; by assumption. 


4.1.1 Two-Stage Least Squares 


The reduced-form equation, (4.1.4b), can be derived by substi- 
tuting the first-stage equation, (4.1.4a), into the causal relation 
of interest, (4.1.6), which is also called a “structural equation” 
in simultaneous equations language. We have: 


Y; =oa'/X;+ p[X;nio + 711Z; + éul tn; (4.1.7) 
= Xjla + prio) + om11Z; + [p£ + n] 
= X}mo9 + 9212; + £21, 


where 720 = @ + p710, 721 = p711, and £z; = p41; + n; in equa- 
tion (4.1.4b). Equation (4.1.7) again shows why p = ae Note 
also that a slight rearrangement of (4.1.7) gives 


Y; = aX; + p[Xj10 + 711Z] + £2, (4.1.8) 


where [X‘z19 + 711Z;] is the population fitted value from the 
first-stage regression of s; on X; and z;. Because z; and X; are 
uncorrelated with the reduced-form error, &;, the coefficient 
on [X‘st19 + 711Z;] in the population regression of Y; on X; and 
[Xito + 711Z;] equals p. 

In practice, of course, we almost always work with data 
from samples. Given a random sample, the first-stage fitted 
values are consistently estimated by 


` R X 
S; = X i10 + T11Zi, 
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where 710 and 71; are OLS estimates from equation (4.1.4a). 
The coefficient on §; in the regression of y; on X; and §; is called 
the two-stage least squares (2SLS) estimator of p. In other 
words, 2SLS estimates can be constructed by OLS estimation 
of the “second-stage equation,” 


Y; = aX; + pS; + [n; + p(s; — 3;)], (4.1.9) 


This is called 2SLS because it can be done in two steps, the 
first estimating $; using equation (4.1.4a) and the second esti- 
mating equation (4.1.9). The resulting estimator is consistent 
for p because the covariates and first-stage fitted values are 
uncorrelated with both n; and (s; —§;). 

The 2SLS name notwithstanding, we don’t usually construct 
2SLS estimates in two steps. For one thing, the resulting stan- 
dard errors are wrong, as we discuss later. Typically, we let 
specialized software routines (such as are available in SAS or 
Stata) do the calculation for us. This gets the standard errors 
right and helps to avoid other mistakes (see section 4.6.1). 
Still, the fact that the 2SLS estimator can be computed by 
a sequence of OLS regressions is one way to remember why 
it works. Intuitively, conditional on covariates, 2SLS retains 
only the variation in s; that is generated by quasi-experimental 
variation—that is, generated by the instrument Z;. 

2SLS is a many-splendored thing. For one, it is an IV esti- 
mator: the 2SLS estimate of p in (4.1.9) is the sample analog 

f Cou(vis8F) 
Cov(s;,S*) 3 
on X;. This follows from the multivariate regression anatomy 
formula and the fact that Cov(s;, $*) = V(S*). It is also easy 
to show that, in a model with a single endogenous variable 
and a single instrument, the 2SLS estimator is the same as the 
corresponding ILS estimator.’ 


where S* is the residual from a regression of $; 


SNote that $* = 3711, where Z; is the residual from a regression of z; on 
i gre 
Xj, so that the 2SLS estimator is the sample analog of [Sept | (ti). But 
i 


See is the OLS estimate of 721 


in the reduced form, (4.1.4b), while 711 is the OLS estimate of the first-stage 
effect, 711, in (4.1.4a). Hence, 2SLS with a single instrument is ILS, that is, the 
ratio of the reduced-form effect of the instrument to the corresponding first- 
stage effect where both the first-stage and reduced-form equations include 
covariates. 


the sample analog of the numerator, 
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The link between 2SLS and IV warrants a bit more elabora- 
tion in the multi-instrument case. Assuming each instrument 
captures the same causal effect (a strong assumption that is 
relaxed below), we might want to combine these alternative 
IV estimates into a single more precise estimate. In models 
with multiple instruments, 2SLS accomplishes this by com- 
bining multiple instruments into a single instrument. Suppose, 
for example, we have three instrumental variables, z1;, Z2;, 
and z;3;. In the Angrist and Krueger (1991) application, these 
are dummies for first-, second-, and third-quarter births. The 
first-stage equation then becomes 


S; = Xini + 711Z1; + T1222; + 11323; + 1i» (4.1.10a) 


while the 2SLS second stage is the same as (4.1.9), except that 
the fitted values are from (4.1.10a) instead of (4.1.4a). The 
IV interpretation of this 2SLS estimator is the same as before: 
the instrument is the residual from a regression of first-stage 
fitted values on exogenous covariates. The exclusion restric- 
tion in this case is the claim that the quarter-of-birth dummies 
in (4.1.10a) are uncorrelated with n; in equation (4.1.6). 

The results of 2SLS estimation of the economic returns to 
schooling using quarter-of-birth dummies as instruments are 
shown in table 4.1.1, which reports OLS and 2SLS estimates 
of models similar to those estimated by Angrist and Krueger 
(1991). Each column in the table contains OLS and 2SLS 
estimates of p from an equation like (4.1.6), estimated with 
different combinations of instruments and control variables. 
The OLS estimate in column 1 is from a regression of log wages 
with no control variables, while the OLS estimates in column 2 
are froma model adding dummies for year of birth and state of 
birth as control variables. In both cases, the estimated return 
to schooling is around .075. 

The first pair of IV estimates, reported in columns 3 and 
4, are from models without exogenous covariates. The instru- 
ment used to construct the estimate in column 3 is a single 
dummy for first-quarter births, while the instruments used to 
construct the estimate in column 4 are three dummies indicat- 
ing first-, second-, and third-quarter births. These estimates 
range from .10 to .11. The results from models including year- 
of-birth and state-of-birth dummies as exogenous covariates 


TABLE 4.1.1 
2SLS estimates of the economic returns to schooling 


OLS 2SLS 
(1) (2) (3) (4) (5) (6) (7) (8) 


Years of education .071 .067 .102 .13 .104 .108 .087 .057 
(.0004) (.0004) (.024) (.020) (.026) (.020) (.016) (.029) 


Exogenous Covariates 


Age (in quarters) v 
Age (in quarters) squared v 
9 year-of-birth dummies v vV v v v 
50 state-of-birth dummies v v v v v 
Instruments 
dummy for QOB = 1 vV v v v v vA 
dummy for QOB = 2 v v v v 
dummy for QOB = 3 v vV vV vV 
QOB dummies interacted with v vV 


year-of-birth dummies 
(30 instruments total) 


Notes: The table reports OLS and 2SLS estimates of the returns to schooling using the Angrist and Krueger (1991) 
1980 census sample. This sample includes native-born men, born 1930-39, with positive earnings and nonallocated 
values for key variables. The sample size is 329,509. Robust standard errors are reported in parentheses. QOB denotes 
quarter of birth. 
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(reported in columns 5 and 6) are similar, not surprisingly, 
since quarter of birth is not closely related to either of these 
controls. Overall, the 2SLS estimates are mostly a bit larger 
than the corresponding OLS estimates. This suggests that 
the observed association between schooling and earnings is 
not driven by omitted variables such as ability and family 
background. 

Column 7 in table 4.1.1 shows the results of adding inter- 
action terms to the instrument list. In particular, this speci- 
fication adds three quarter-of-birth dummies interacted with 
nine dummies for year of birth (the sample includes cohorts 
born in 1930-39), for a total of 30 excluded instruments. The 
first-stage equation becomes 


S; = Xho + 11 Z1; + 112.22) + 11323; (4.1.10b) 


+ ». (ByZ1i)Kaj + y (ByjZ2i)Kaj + > (ByZ3i)K3; + E1i5 
j j j 


where Bj is a dummy equal to one if individual i was born 
in year j for j equal to 1931-39. The coefficients Kij, K2j, K3j 
are the corresponding quarter and year interaction terms. The 
rationale for adding these interaction terms is an increase in 
precision that comes from increasing the first-stage R*, which 
goes up because the quarter-of-birth pattern in schooling dif- 
fers across cohorts. In this example, the addition of interaction 
terms to the instrument list leads to a modest gain in precision; 
the standard error declines from .019 to .016 as we move from 
column 6 to column 7.° (The first-stage and reduced-form 
effects plotted in figure 4.1.1 are from this fully interacted 
specification. ) 

The last 2SLS model reported in table 4.1.1 adds controls 
for linear and quadratic terms in age in quarters to the list 
of exogenous covariates. In other words, someone who was 
born in the first quarter of 1930 is recorded as being 50 years 
old on census day (April 1), 1980, while someone born in 
the fourth quarter is recorded as being 49.25 years old. This 


This gain may not be without cost, as the use of many additional instrum- 
ents opens up the possibility of increased bias, an issue discussed in sec- 
tion 4.6.4. 
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finely coded age variable provides a partial control for the 
fact that small differences in age may be an omitted variable 
that confounds the quarter-of-birth identification strategy. As 
long as the effects of age are reasonably smooth, the quadratic 
age-in-quarters model will pick them up. 

Columns 7 and 8 in table 4.1.1 illustrate the interplay 
between identification and estimation. (In traditional SEM the- 
ory, a parameter is said to be identified if we can figure it out 
from the reduced form.) For the 2SLS procedure to work, there 
must be some variation in the first-stage fitted values condi- 
tional on whatever exogenous covariates are included in the 
model. If the first-stage fitted values are a linear combination 
of the included covariates, then the 2SLS estimate simply does 
not exist. In equation (4.1.9) this would be manifest by perfect 
multicollinearity (i.e., linear dependence between X; and §;). 
2SLS estimates with quadratic age controls exist, but the vari- 
ability “left over” in the first-stage fitted values is reduced 
when the covariates include variables such as age in quar- 
ters that are closely related to the instruments (quarter-of-birth 
dummies). Because this variability is the primary determinant 
of 2SLS standard errors, the estimate in column 8 is markedly 
less precise than that in column 7, though it is still close to the 
corresponding OLS estimate. 


Recap of IV and 2SLS Lingo 


As we’ve seen, the endogenous variables are the dependent 
variable and the independent variable(s) to be instrumented; 
in a simultaneous equations model, endogenous variables are 
determined by solving a system of stochastic linear equa- 
tions. To treat an independent variable as endogenous is to 
instrument it, in other words, to replace it with fitted values 
in the second stage of a 2SLS procedure. The independent 
endogenous variable in the Angrist and Krueger (1991) study 
is schooling. The exogenous variables include the exoge- 
nous covariates that are not instrumented and the instruments 
themselves. In a simultaneous equations model, exogenous 
variables are determined outside the system. The exogenous 
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covariates in the Angrist and Krueger (1991) study are dum- 
mies for year of birth and state of birth. We think of exogenous 
covariates as controls. 2SLS aficionados live in a world of 
mutually exclusive labels: in any empirical study involving IV, 
the random variables to be studied are either dependent 
variables, independent endogenous variables, instrumental 
variables, or exogenous covariates. Sometimes we shorten this 
to dependent and endogenous variables, instruments, and 
covariates (fudging the fact that the dependent variable is also 
endogenous in a traditional SEM). 


4.1.2 The Wald Estimator 


The simplest IV estimator uses a single dummy instrument 
to estimate a model with one endogenous regressor and no 
covariates. Without covariates, the causal regression model is 


Y; = Q + pPSjt+ i, (4.1.11) 


where n; and s; may be correlated. Given the further sim- 
plification that z; is a dummy variable that equals one with 
probability p, we can easily show that 


Cov(Y;, Zi) = {E[Y;|z; = 1] — Elyilz; = 0}}p(1 — p), 
with an analogous formula for Cov(s;, z;). It therefore follows 
that 


_ Ely,|z; = 1] — Elyi|z; = 0] waits 
°= Fisilz = 1) —Elsilz = 0) NG 


A direct route to this result uses (4.1.11) and the fact that 
E[n;|Z;] = 0, so we have 


E[y,|Zi] = æ + pE[s;|z;]. (4.1.13) 
Solving this equation for p produces (4.1.12). 


Equation (4.1.12) is the population analog of the landmark 
Wald estimator for a bivariate regression with mismeasured 


128 Chapter 4 


regressors.’ In our context, the Wald formula provides an 
appealingly transparent implementation of the IV strategy for 
the elimination of OVB. The principal claim that motivates IV 
estimation of causal effects is that the only reason for any rela- 
tion between the dependent variable and the instrument is the 
effect of the instrument on the causal variable of interest. In 
the context of a dummy instrument, it therefore seems natural 
to divide—or rescale—the reduced-form difference in means 
by the corresponding first-stage difference in means. 

The Angrist and Krueger (1991) study using quarter of 
birth to estimate the economic returns to schooling shows the 
Wald estimator in action. Table 4.1.2 displays the ingredients 
behind a Wald estimate constructed using the 1980 census. 
The difference in earnings between men born in the first and 
fourth quarters of the year is —.0135, while the corresponding 
difference in schooling is —.151. The ratio of these two differ- 
ences is a Wald estimate of the economic value of schooling in 
per-year terms. This comes out to be .089. Not surprisingly, 
this estimate is not too different from the 2SLS estimates in 
table 4.1.1. The reason we should expect the Wald and 2SLS 
estimates to be similar is that both are constructed from the 
same information: differences in earnings by season of birth. 

The Angrist (1990) study of the effects of Vietnam-era mil- 
itary service on the earnings of veterans also shows the Wald 
estimator in action. In the 1960s and early 1970s, young 
American men were at risk of being drafted for military service. 
Concerns about the fairness of the U.S. conscription policy led 
to the institution of a draft lottery in 1970 that was used to 
determine priority for conscription. A promising instrument 
for Vietnam veteran status is therefore draft eligibility, since 
this was determined by a lottery over birthdays. Specifically, 


7 As noted in the introduction to this chapter, measurement error in regres- 
sors tends to shrink regression coefficients toward zero. To eliminate this bias, 
Wald (1940) suggested that the data be divided in a manner independent of 
the measurement error, and the coefficient of interest estimated as a ratio 
of differences in means, as in (4.1.12). Durbin (1954) showed that Wald’s 
method of fitting straight lines is an IV estimator where the instrument is a 
dummy marking Wald’s division of the data. Hausman (2001) provides an 
overview of econometric strategies for dealing with measurement error. 
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TABLE 4.1.2 
Wald estimates of the returns to schooling using 
quarter-of-birth instruments 


(1) (2) (3) 
Born in Born in Difference 
Ist Quarter 4th Quarter (Std. Error) 
of Year of Year (1) — (2) 
In (weekly wage) 5.892 5.905 —.0135 
(.0034) 
Years of education 12.688 12.839 —.151 
(.016) 
Wald estimate of .089 
return to education (.021) 
OLS estimate of .070 
return to education (.0005) 


Notes: From Angrist and Imbens (1995). The sample includes native- 
born men with positive earnings from the 1930-39 birth cohorts in the 
1980 census 5 percent file. The sample size is 162,515. 


in each year from 1970 to 1972, random sequence numbers 
(RSNs) were randomly assigned to each birth date in cohorts 
of 19-year-olds. Men with lottery numbers below a cutoff were 
eligible for the draft, while men with numbers above the cutoff 
could not be drafted. In practice, many draft-eligible men were 
still exempted from service for health or other reasons, while 
many men who were draft-exempt nevertheless volunteered 
for service. So veteran status was not completely determined 
by randomized draft eligibility, but draft eligibility provides a 
dummy instrument highly correlated with Vietnam-era veteran 
status. 

Among white men who were at risk of being drafted in 
the 1970 draft lottery, draft eligibility is clearly associated 
with lower earnings in the years after the lottery. This is 
documented in table 4.1.3, which reports the effect of random- 
ized draft eligibility status on Social Security-taxable earnings 
in column 2. Column 1 shows average annual earnings for 
purposes of comparison. For men born in 1950, there are 
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TABLE 4.1.3 
Wald estimates of the effects of military service on the earnings of white 
men born in 1950 


Earnings Veteran Status 
Eligibility Eligibility | Wald Estimate 
Earnings Mean Effect Mean Effect of Veteran Effect 
Year (1) (2) (3) (4) (5) 
1981 16,461 —435.8 .267 .159 —2,741 
(210.5) (.040) (1,324) 

1971 3,338  —325.9 —2,050 

(46.6) (293) 
1969 2,299 —2.0 

(34.5) 


Notes: Adapted from Angrist (1990), tables 2 and 3. Standard errors are shown in 
parentheses. Earnings data are from Social Security administrative records. Figures 
are in nominal dollars. Veteran status data are from the Survey of Income and 
Program Participation. There are about 13,500 individuals in the sample. 


significant negative effects of eligibility status on earnings in 
1971, when these men were mostly just beginning their mil- 
itary service, and, perhaps more surprisingly, in 1981, ten 
years later. In contrast, there is no evidence of an association 
between draft eligibility status and earnings in 1969, the year 
the lottery drawing for men born in 1950 was held but before 
anyone born in 1950 was actually drafted. 

Because eligibility status was randomly assigned, the claim 
that the estimates in column 2 represent the casual effect of 
draft eligibility on earnings seems uncontroversial. The infor- 
mation required to go from draft eligibility effects to veteran 
status effects is the denominator of the Wald estimator, which 
is the effect of draft eligibility on the probability of serving 
in the military. This information is reported in column 4 of 
table 4.1.3, which shows that draft-eligible men were almost 
16 percentage points more likely to have served in the Vietnam 
era. The Wald estimate of the effect of military service on 1981 
earnings, reported in column 4, amounts to about 15 percent 
of the mean. Effects were even larger in 1971 (in percentage 
terms), when affected soldiers were still in the army. 
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An important feature of the Wald/IV estimator is that the 
identifying assumptions are easy to assess and interpret. Let D; 
denote Vietnam-era veteran status and Z; indicate draft eligibil- 
ity. The fundamental claim justifying our interpretation of the 
Wald estimator as capturing the causal effect of D; is that the 
only reason why E[y;|z;] changes as z; changes is the variation 
in E[p,|z;]. A simple check on this is to look for an associa- 
tion between z; and personal characteristics that should not be 
affected by D;, for example race, sex, or any other character- 
istic that was determined before D; was determined. Another 
useful check is to look for an association between the instru- 
ment and outcomes in samples where there is no relationship 
between D; and z;. If the only reason for draft eligibility effects 
on earnings is veteran status, then draft eligibility effects on 
earnings should be zero in samples where draft eligibility status 
is unrelated to veteran status. 

This idea is illustrated in Angrist’s (1990) study of the draft 
lottery by looking at 1969 earnings, an estimate repeated in the 
last row of table 4.1.3. It’s comforting that the draft eligibility 
treatment effect on 1969 earnings is zero, since 1969 earnings 
predate the 1970 draft lottery. A second variation on this idea 
looks at the cohort of men born in 1953. Although there was a 
lottery drawing that assigned RSNs to the 1953 birth cohort in 
February 1972, no one born in 1953 was actually drafted (the 
draft officially ended in July 1973). The first-stage relation- 
ship between draft eligibility and veteran status for men born 
in 1953 (defined using the 1952 lottery cutoff of 95) there- 
fore shows only a small difference in the probability of serving 
by eligibility status. There is also no significant relationship 
between earnings and draft eligibility status for men born in 
1953, a result that supports the claim that the only reason for 
draft eligibility effects is military service. 

We conclude the discussion of Wald estimators with a set of 
IV estimates of the effect of family size on mothers’ employ- 
ment and work. Like the schooling and military service studies, 
these estimates are used for illustration elsewhere in the book. 
The relationship between fertility and labor supply has long 
been of interest to labor economists, while the case for omitted 
variables bias in this context is clear: mothers with weak labor 
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force attachment or low earnings potential may be more likely 
to have children than mothers with strong labor force attach- 
ment or high earnings potential. This makes the observed 
association between family size and employment hard to inter- 
pret, since mothers who have big families probably would have 
worked less anyway. Angrist and Evans (1998) solve this omit- 
ted variables problem using two instrumental variables, both 
of which lend themselves to Wald-type estimation strategies. 

The first Wald estimator uses multiple births, an identifica- 
tion strategy for the effects of family size pioneered by Rosen- 
zweig and Wolpin (1980). The twins instrument in Angrist 
and Evans (1998) is a dummy for a multiple second birth in a 
sample of mothers with at least two children. The twins first- 
stage is .625, an estimate reported in column 3 of table 4.1.4. 
This means that 37.5 percent of mothers with two or more 
children would have had a third birth anyway; a multiple 
third birth increases this proportion to 1. The twins instru- 
ment rests on the idea that the occurrence of a multiple birth is 
essentially random, unrelated to potential outcomes or family 
background. 

The second Wald estimator in table 4.1.4 uses sibling sex 
composition, an instrument motivated by the fact that Amer- 
ican parents with two children are much more likely to have 
a third child if the first two are of the same sex than if the 
sex composition is mixed. This is illustrated in column 5 of 
table 4.1.4, which shows that parents of same-sex sibling birth 
are 6.7 percentage points more likely to have a third birth (the 
probability of a third birth among parents with a mixed-sex 
sibship is .38). The same-sex instrument is based on the claim 
that sibling sex composition is essentially random and affects 
family labor supply solely by increasing fertility. 

Twins and sex composition instruments both suggest that 
the birth of a third child has a large effect on employment rates 
and on weeks and hours worked. Wald estimates using twins 
instruments show a precisely estimated employment reduction 
of about .08, while weeks worked fall by 3.8 and hours per 
week fall by 3.4. These results, which appear in column 4 
of table 4.1.4, are smaller in absolute value than the corre- 
sponding OLS estimates reported in column 2. This suggests 
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TABLE 4.1.4 
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Wald estimates of the effects of family size on labor supply 


IV Estimates Using 


Twins Sex Composition 
First Wald First Wald 
Dependent Mean OLS Stage Estimates Stage Estimates 
Variable (1) (2) (3) (4) (5) (6) 
Employment .528 —.167 .625  —.083 .067 —.135 
(.002) (.011) (.017)  (.002) (.029) 
Weeks worked 19.0 —8.05 —3.83 —6.23 
(.09) (.76) (1.29) 
Hours/week 16.7 —6.02 —3.39 —5.54 
(.08) (.64) (1.08) 


Note: The table reports OLS and Wald estimates of the effects of a third birth on labor 
supply using twins and sex composition instruments. Data are from the Angrist and 
Evans (1998) extract including married women aged 21-35 with at least two children 
in the 1980 census. OLS models include controls for mother’s age, age at first birth, 
dummies for the sex of first and second births, and dummies for race. The first stage is 


the same for all dependent variables. 


the latter are exaggerated by selection bias. Interestingly, the 
Wald estimates constructed using a same-sex dummy, reported 
in column 6, are larger than the twins estimates (showing an 
employment reduction of .135, for example). The juxtaposi- 
tion of twins and sex composition instruments in table 4.1.4 
suggests that different instruments need not generate similar 
estimates of causal effects even if both are valid. We expand 
on this important point in section 4.4. For now, however, we 
stick with a constant effects framework. 


4.1.3 Grouped Data and 2SLS 


The Wald estimator is the mother of all IV estimators because 
more complicated 2SLS estimators can typically be constructed 
from an underlying set of Wald estimators. The link between 
Wald and 2SLS is grouped data: 2SLS using dummy instru- 
ments is the same thing as GLS on a set of group means. GLS 
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in turn can be understood as a linear combination of all the 
Wald estimators that can be constructed from pairs of means. 
The generality of this link might appear to be limited by the 
presumption that the instruments at hand are dummies. Not 
all instrumental variables are dummies, or even discrete, but 
this is not really important. For one thing, many instruments 
can be thought of as defining categories, such as quarter of 
birth. Moreover, instrumental variables that appear more con- 
tinuous (such as draft lottery numbers, which range from 1 to 
365) can usually be grouped without much loss of information 
(e.g., a single dummy for draft eligibility status, or dummies 
for groups of 25 lottery numbers).® 

To explain the Wald-grouping-2SLS nexus more fully, we 
stick with the draft lottery study. Earlier we noted that draft 
eligibility is a promising instrument for Vietnam-era veteran 
status. The draft eligibility ceilings were RSN 195 for men 
born in 1950, RSN 125 for men born in 1951, and RSN 
95 for men born in 1952. In practice, however, there is a 
richer link between draft lottery numbers (which we’ll call R;, 
short for RSN) and veteran status (D;) than draft eligibility 
status alone. Although men with numbers above the eligibility 
ceiling were not drafted, the ceiling was unknown in advance. 
Some men therefore volunteered in the hope of serving under 
better terms and gaining some control over the timing of their 
service. The pressure to become a draft-induced volunteer was 
high for men with low lottery numbers but low for men with 
high numbers. As a result, there is variation in P[D; = 1|R;] 
even for values strictly above or below the draft eligibility cut- 
off. For example, men born in 1950 with lottery numbers 
200-225 were more likely to serve than those with lottery 
numbers 226-250, though ultimately no one in either group 
was drafted. 

The Wald estimator using draft eligibility as an instrument 
for men born in 1950 compares the earnings of men with R; 
< 195 to the earnings of men with R; > 195. But the previous 


8 An exception is the classical measurement error model, where both the 
variable to be instrumented and the instrument are assumed to be continuous. 
Here, we have in mind IV scenarios involving OVB. 
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discussion suggests the possibility of many more comparisons, 
for example men with R; < 25 versus men with rR; € [26 — 50], 
men with R; € [51 — 75] versus men with R; € [76 — 100], and 
so on, until these 25-number intervals are exhausted. We 
might also make the intervals finer, comparing, say, men in 
five-number or single-number intervals instead of 25-number 
intervals. The result of this expansion in the set of compar- 
isons is a set of Wald estimators. These sets are complete in 
that the intervals partition the support of the underlying instru- 
ment, while the individual estimators are linearly independent 
in the sense that their numerators are linearly independent. 
Finally, each of these Wald estimators consistently estimates 
the same causal effect, assumed here to be constant, as long 
as R; is independent of potential outcomes and correlated 
with veteran status (i.e., the Wald denominators are not 
zero). 

The possibility of constructing multiple Wald estimators for 
the same causal effect naturally raises the question of what 
to do with all of them. We would like to come up with a 
single estimate that somehow combines the information in the 
individual Wald estimates efficiently. As it turns out, the most 
efficient linear combination of a full set of linearly independent 
Wald estimates is produced by fitting a line through the group 
means used to construct these estimates. 

The grouped data estimator can be motivated directly as 
follows. As in (4.1.11), we work with a bivariate constant 
effects model, which in this case can be written 


Y; = œ + pDi + Ni, (4.1.14) 


where p = Y1; — Yo; is the causal effect of interest and yo; = 
a + n;. Because R; was randomly assigned and lottery numbers 
are assumed to have no effect on earnings other than through 
veteran status, E[n;|R;] = 0. It therefore follows that 


Efy;|Ri] = œ + pP[p; = 1|R;, (4.1.15) 


since P[D; = 1|R;] = E[D;|R;]. In other words, the slope of the 
line connecting average earnings given lottery number with 
the average probability of service by lottery number is equal 
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to the effect of military service, p. This is in spite of the fact that 
the regression Y; on D;—in this case, the difference in means 
by veteran status—almost certainly differs from p, since Yo; 
and n; are likely to be correlated. 

Equation (4.1.15) suggests we estimate e by fitting a line to 
the sample analog of E[Y;|R;] and P[p; = 1|R;]. Suppose that 
R; takes on values j = 1,...,J. In principle, j might run from 1 
to 365, but in Angrist (1990), lottery number information was 
aggregated to 69 five-number intervals, plus a 70th interval for 
numbers 346-365. We can therefore think of R; as running 
from 1 to 70. Let J; and f; denote estimates of E[y;|R; = j] 
and P[p; = 1|R; = j], while 7; denotes the average error 
in (4.1.14). Because sample moments converge to population 
moments, it follows that OLS estimates of p in the grouped 
equation 


Fj = + oi + i (4.1.16) 


are consistent. In practice, however, generalized least squares 
(GLS) may be preferable, since a grouped equation is het- 
eroskedastic with a known variance structure. The efficient 
GLS estimator for grouped data in a constant effects linear 
model is WLS, weighted by the variance of ij; (see, e.g., Prais 
and Aitchison, 1954, or Wooldridge, 2006). Assuming the 


microdata residual is homoskedastic with variance Gr, this 


2 
. . Oo, . . 
variance is —, where n; is the group size. Therefore, we should 
1 


weight by the group size, as discussed in a different context in 
section 3.4.1. 

The GLS (or WLS) estimator of p in equation (4.1.16) is 
especially important for two reasons. First, the GLS slope 
estimate constructed from J grouped observations is an asymp- 
totically efficient linear combination of any full set of J — 1 
linearly independent Wald estimators (Angrist, 1991). This 
can be seen without any mathematics: GLS and any linear com- 
bination of Wald estimators are both linear combinations of 
the grouped dependent variable. Moreover, GLS is the asymp- 
totically efficient linear estimator for grouped data. Therefore 
we can conclude that there is no better (i.e., asymptotically 


Instrumental Variables in Action 137 


more efficient) linear combination of Wald estimators than 
GLS (again, a maintained assumption here is that p is con- 
stant). The formula for constructing the GLS estimator from 
a full set of linearly independent Wald estimators appears in 
Angrist (1988). 

Second, just as each Wald estimator is also an IV estimator, 
the GLS estimator of equation (4.1.16) is 2SLS. The instru- 
ments in this case are a full set of dummies to indicate each 
lottery number cell. To see why, define the set of dummy 
instruments Z; = {rj = 1[R; = /];7 = 1,...J— 1}, where 1[-] 
denotes the indicator function used to construct dummy vari- 
ables. Now, consider the first-stage regression of D; on Z; 
plus a constant. Since this first stage is saturated, the fitted 
values will be the sample conditional means, fj, repeated n; 
times for each j. The second-stage slope estimate is therefore 
the same as the slope from WLS estimation of the grouped 
equation, (4.1.16), weighted by the cell size, nj. 

The connection between grouped data and 2SLS is of both 
conceptual and practical importance. On the conceptual side, 
any 2SLS estimator using a set of dummy instruments can be 
understood as a linear combination of all the Wald estimators 
generated by using these instruments one at a time. The Wald 
estimator in turn provides a simple framework used later in 
this chapter to interpret IV estimates in the more realistic world 
of heterogeneous potential outcomes. 

Although not all instruments are inherently discrete and 
therefore immediately amenable to a Wald or grouped data 
interpretation, many are. Examples include the draft lottery 
number, quarter of birth, twins, and sibling sex composi- 
tion instruments we’ve already discussed. (See also the recent 
studies by Bennedsen et al., 2007, and Ananat and Michaels, 
2008, both of which use dummies for male first births as 
instruments.) Moreover, instruments that have a continuous 
flavor can often be fruitfully turned into discrete variables. For 
example, Angrist, Graddy, and Imbens (2000) recode contin- 
uous weather-based instruments into three dummy variables, 
stormy, mixed, and clear, which they then use to estimate 
the demand for fish. This dummy variable parameterization 
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seems to capture the main features of the relationship between 
weather conditions and the price of fish.” 

On the practical side, the grouped data equivalent of 2SLS 
gives us a simple tool that can be used to explain and evaluate 
any IV strategy. In the case of the draft lottery, for example, 
the grouped model embodies the assumption that the only rea- 
son average earnings vary with lottery numbers is the variation 
in probability of military service across lottery number groups. 
If the underlying causal relation is linear with constant effects, 
then equation (4.1.16) should fit the group means well, some- 
thing we can assess by inspection and, as discussed in the next 
section, with the machinery of formal statistical inference. 

Sometimes labor economists refer to grouped data plots for 
discrete instruments as visual instrumental variables (VIV).!° 
An example appears in Angrist (1990), reproduced here as fig- 
ure 4.1.2. This figure shows the relationship between average 
earnings in five-number RSN cells and the probability of ser- 
vice in these cells, for the 1981-84 earnings of white men born 
in 1950-53. The slope of the line through these points is an 
IV estimate of the earnings loss due to military service, in this 
case about $2,400, not very different from the Wald estimates 
discussed earlier but with a lower standard error (in this case, 
about $800). 


4.2 Asymptotic 2SLS Inference 


4.2.1 The Asymptotic Distribution 
of the 2SLS Coefficient Vector 


We can derive the limiting distribution of the 2SLS coefficient 
vector using an argument similar to that used in section 3.1.3 
for OLS. In this case, let V; = [X; §;]/ denote the vector of 
regressors in the 2SLS second stage, equation (4.1.9). The 2SLS 


? Continuous instruments recoded as dummies can be seen as providing 
a parsimonious nonparametric model for the underlying first-stage relation, 
E[p;|z;]. In homoskedastic models with constant coefficients, E[D;|Z;] is the 
asymptotically efficient instrument (Newey, 1990). 

10See, for example, the preface to Borjas (2005). 
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Figure 4.1.2 The relationship between average earnings and the 
probability of military service (from Angrist, 1990). This is a VIV 
plot of average 1981-84 earnings by cohort and five-RSN lottery 
number group against conditional probabilities of veteran status in 
the same cells. The sample includes white men born in 1950-53. 
Plotted points consist of average residuals (over four years of 
earnings) from regressions on period and cohort effects. The slope 
of the least squares regression line drawn through the points is 
—2,384, with a standard error of 778. 


estimator can then be written 
=i 
Ios15 = bp vv: X Vx, 
i i 


where T = [a’ p] is the corresponding coefficient vector. Note 
that 


-1 
Posrs =T + z 2 > Vilni + o(s; — $i) ] 
+1 
=+ £ vv: XO Vin, (4.2.1) 


where the second equality comes from the fact that the first- 
stage residuals, s; — ŝ;, are orthogonal to V; in the sample. The 
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asymptotic distribution of the 2SLS coefficient vector is there- 
fore the asymptotic distribution of [X ; ViVi >; Vini. This 
quantity is a little harder to work with than the correspond- 
ing OLS quantity, because the regressors in this case involve 
estimated fitted values, §;. A Slutsky-type argument shows, 
however, that we get the same limiting distribution replacing 
estimated fitted values with the corresponding population fit- 
ted values (i.e., replacing $; with [X/719 + 711Z;]). It therefore 
follows that Îzszs has an asymptotically normal distribution, 
with probability limit T, and a covariance matrix estimated 
consistently by [}°; ViVi} PBRAe bee ViVi]. This is a 
sandwich formula like the one for OLS standard errors (White, 
1982). Much as with OLS, if n; is conditionally homoskedas- 
tic given covariates and instruments, the consistent covariance 
matrix estimator simplifies to |}; V; Vi} ‘02. 

There is little new here, but there is one tricky point. It seems 
natural to construct 2SLS estimates manually by estimating 
the first stage (4.1.4a) and then plugging the fitted values into 
equation (4.1.9) and estimating this by OLS. That’s fine as far 
as the coefficient estimates go, but the resulting standard errors 
are wrong. Conventional regression software does not know 
that you are trying to construct a 2SLS estimate. When con- 
structing standard errors, the software computes the residual 
variance of the equation you estimate by OLS in the manual 
second stage: 


Y; — [e X; + p5;] = [ni + (Si — 5;)], 


replacing the coefficients œ and p with the corresponding 
second-stage estimates. The correct residual variance esti- 
mator, however, uses the original endogenous regressor to 
construct residuals and not the first-stage fitted values, §;. 
In other words, the residual you want is the estimated Y; — 
[a’X; + psi] = ni, so as to consistently estimate cee and not 
the variance of n; + p(s; — ŝ;). Although this problem is easy to 
fix (you can construct the appropriate residual variance esti- 
mator in a separate calculation), software designed for 2SLS 
gets this right automatically, and may help you avoid other 
common 2SLS mistakes. 
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4.2.2 Overidentification and the 2SLS Minimand* 


Constant effects models with more instruments than endoge- 
nous variables are said to be overidentified. (Models with the 
same number of instruments and endogenous variables are 
said to be just-identified.) With more instruments than needed 
to identify the parameters of interest, overidentified models 
impose a set of restrictions that can be evaluated as part of 
a process of specification testing. This process amounts to 
asking whether the line plotted in a VIV-type picture fits the 
relevant conditional means tightly enough, given the precision 
with which these means are estimated. The details behind this 
useful idea are easiest to spell out using matrix notation. 

Let Z; = [X Z1; «++ Zoi]! denote the vector formed by con- 
catenating the exogenous covariates and Q instrumental vari- 
ables, and let W; = [X; s;]' denote the vector formed by con- 
catenating the covariates and the single endogenous variable of 
interest. In the quarter-of-birth study, for example, the covari- 
ates are year-of-birth and state-of-birth dummies, the instru- 
ments are quarter-of-birth dummies, and the endogenous 
variable is schooling. The coefficient vector is still T = [a’ p], 
as in the previous subsection. The residuals for the causal 
(second-stage) model can be defined as a function of I using 


ni(T) = Y; — I'W; = Y; —[o’X; + psi]. 


This residual is assumed to be uncorrelated with the instru- 
ment vector, Z;. In other words, n; satisfies the orthogonality 
condition, 

E[Zin:(T)] =0. (4.2.2) 


In any sample, however, this equation will not hold exactly 
because there are more moment conditions than there are 
elements of r.!! The sample analog of (4.2.2) is the sum over i, 


1 
we ZnT) = myl). (4.2.3) 


‘With a single endogenous variable and more than one instrument, I is 
[x+ 1] x 1, while Z; is [K+ Q] x 1 for Q > 1. Hence the resulting linear system 
cannot be solved exactly unless there is a linear dependency that makes some 
of the instruments (moment equations) redundant. 
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2SLS can be understood as a generalized method of moments 
(GMM) estimator that chooses a value for r by making (4.2.3) 
as close to zero as possible. 

By the central limit theorem, the sample moment vec- 
tor /Nmy() has an asymptotic covariance matrix equal 
to E[Z;Z'ni(T i ], a matrix we’ll call A. Although somewhat 
intimidating at first blush, this is just a matrix of fourth 
moments, as in the sandwich formula used to construct robust 
standard errors, (3.1.7). As shown by Hansen (1982), the opti- 
mal GMM estimator based on (4.2.2) minimizes a quadratic 
form in the sample moment vector, myn (ẹ), where @ is a can- 
didate estimator of r. The optimal weighting matrix in the 
middle of the GMM quadratic form is A~!. In practice, of 
course, A is unknown and must be estimated. A feasible ver- 
sion of the GMM procedure uses a consistent estimator of 
A in the weighting matrix. Since the estimators using known 
and estimated A have the same asymptotic distribution, we’ll 
ignore this distinction for now. The quadratic form to be 
minimized can therefore be written, 


Jnl) = Ny(8) Amn (8), (4.2.4) 


where the N-term out front comes from ~N normalization 
of the sample moments. As shown immediately below, when 
the residuals are conditionally homoskedastic, the minimizer 
of Jx(@) is the 2SLS estimator. Without homoskedasticity, the 
GMM estimator that minimizes (4.2.4) is White’s (1982) two- 
stage IV (a generalization of 2SLS), so it makes sense to call 
In(&) the 2SLS minimand. 

Here are some of the details behind the GMM interpretation 
of 2SLS.!? Conditional homoskedasticity means that 


A = E[Z;Zin(P)"] = ElZ;Z/]o, 


Substituting for A~! and using y, Z, and W to denote sample 
data vectors and matrices, the quadratic form to be minimized 


12 More detail can be found in Newey (1985), Newey and West (1987), the 
advanced text by Amemiya (1985), and the original Hansen (1982) GMM 
paper. 
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becomes 


A 1 av n-i A 
Jn(&) = NoT W28)'ZE[Z;Zi]"'Z'(y — Wê). (4.2.5) 
oh 


Finally, substituting the sample cross-product matrix [24] for 
E[Z;Z‘], we have 


A 


1 
In(8) = = (y — W8)'Pzly— W8), 


oF 
where Pz = Z(Z'Z)~'Z. From here, we get the solution 
&=TPosrs =[W'PzW] ' W'Pzy. 


Since the projection operator, Pz, produces fitted values (i.e., 
PzW gives the fitted values from a regression of W on Z), and 
Pz is an idempotent matrix, this can be seen to be the OLS esti- 
mator of the second-stage equation, (4.1.9), written in matrix 
notation. More generally, even without homoskedasticity we 
can obtain a feasible efficient 2SLS-type estimator by minimiz- 
ing (4.2.4) and using a consistent estimator of E[Z;Z/n;(I")?] to 
form /x(&). Typically, we'd use the empirical fourth moments, 
> Z;Z/4?, where 7; is the regular 2SLS residual computed 
without worrying about heteroskedasticity (see White, 1982, 
for distribution theory and other details). 

The overidentification test statistic is given by the minimized 
2SLS minimand. Intuitively, this statistic tells us whether the 
sample moment vector, mn(), is close enough to zero for 
the assumption that E[Z;n;] = 0 to be plausible. In particular, 
under the null hypothesis that the residuals and instruments 
are indeed uncorrelated, the minimized Jn(ĝ) has a x*(Q—1) 
distribution. We can therefore compare the empirical value of 
the 2SLS minimand with chi-square tables in a formal test for 
Ho: E[l Zini] =0. 

We’re especially interested in the 2SLS minimand when the 
instruments are a full set of mutually exclusive dummy vari- 
ables, as for the Wald estimators and grouped data estimation 
strategies discussed above. In this important special case, 2SLS 
becomes WLS estimation of a grouped equation like (4.1.16), 
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while the 2SLS minimand is the weighted sum of squares being 
minimized. To see this, note that regression on a full set of 
mutually exclusive dummy variables for an instrument that 
takes on J values produces an N x 1 vector of fitted values 
equal to the J conditional means at each value of the instru- 
ment (included covariates are counted as instruments), each 
one of these n; times, where nj is the group size and } 1; = N. 
The cross-product matrix [Z’Z] in this case is a JxJ diagonal 
matrix with elements nj. Simplifying, we then have 


Sana ot Ul > = 
In(® == do n E WY, (4.2.6) 
j 


on 
where W; is the sample mean of the rows of matrix W in 
group j. Thus, În(ê) is the GLS minimand for estimation of the 
regression of y; on W;. With a little more work (here we skip 


the details), we can similarly show that the efficient two-step 
IV procedure without homoskedasticity minimizes 


A a n; _ he 
In@ = D0) G-# WY, (4.2.7) 
j j 
where o? is the variance of n; in group j. Estimation 


using (4.2.7) is feasible because we can estimate o? in a first 


step, using an inefficient but still consistent 2SLS estimator that 
ignores heteroskedasticity. Efficient two-step IV estimators are 
constructed in Angrist (1990, 1991). 

The GLS structure of the 2SLS minimand allows us to inter- 
pret the overidentification test statistic for dummy instruments 
as a measure of the goodness of fit of the line connecting Y; 
and Wj. In other words, this is the chi-square goodness-of-fit 
statistic for the regression line in a VIV plot like that shown 
in figure 4.1.2. The chi-square degrees of freedom parameter 
is given by the difference between the number of instruments 
(groups) and the number of parameters being estimated.!* 


131, for example, the instrument takes on three values, one of which is 
assigned to the constant, and the model includes a constant and a single 
endogenous variable only, the test statistic has 1 degree of freedom. 
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As with the 2SLS estimator, there are many roads to the 
test statistic, (4.2.7). Here are two further paths that are 
worth knowing. First, the test statistic based on the GMM 
minimand for IV, whether the instruments are group dum- 
mies or not, is the same as the overidentification test statistic 
discussed in widely used econometric references on simultane- 
ous equations models. For example, this statistic features in 
Hausman’s (1983) chapter on simultaneous equations in the 
Handbook of Econometrics. Hausman also proposes a simple 
computational procedure: in homoskedastic models, the min- 
imized 2SLS minimand is the sample size times the R* from a 


regression of the 2SLS residuals on the instruments (and the 
. . . D’ D 
exogenous covariates). The formula for this is N[ 1i ], where 


ñ = y — WYP srs is the vector of 2SLS residuals. 

Second, it’s worth emphasizing that the overidentification 
idea can be said to be “more than one way to skin the same 
econometric cat.” In other words, given more than one instru- 
ment for the same causal relation, we might construct just- 
identified IV estimators one at a time and compare them. 
This comparison checks overidentification directly: if each 
just-identified estimator is consistent, the differences between 
them should be small relative to sampling variance, and should 
shrink as the sample size and hence the precision of these esti- 
mates increases. In fact, a formal test of the equality of all 
possible just-identified estimators is said to generate a Wald 
test of this null hypothesis, while the test statistic based on the 
2SLS minimand is said to be a Lagrange multiplier (LM) test 
because it can be related to the score vector in a maximum 
likelihood version of the IV setup.'* 

In the grouped data version of IV, the Wald test amounts to a 
test for equality of the set of all possible linearly independent 
Wald estimators. If, for example, draft lottery numbers are 
divided into four groups based on various cohorts’ eligibility 
cutoffs (RSN 1-95, 96-125, 126-195, and the rest), then 


14The Wald estimator and Wald test are named after the same man, Abra- 
ham Wald, but the latter reference is Wald (1943). Wald, who died tragically 
in a plane crash at the age of 48, was a giant of econometrics as well as 
mathematical statistics. 
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three linearly independent Wald estimators can be constructed. 
Alternatively, the efficient grouped data estimator can be con- 
structed by running GLS on these four conditional means. Four 
groups means there are three possible Wald estimators and 
two nonredundant equality restrictions on these three; hence, 
the relevant Wald statistic has 2 degrees of freedom. On the 
other hand, four groups means three instruments and a con- 
stant are available to estimate a model with two parameters 
(the constant and the causal effect of military service). So the 
2SLS minimand generates an overidentification test statistic 
with 4 — 2 = 2 degrees of freedom. And, provided you use the 
same method of estimating the weighting matrix in the rel- 
evant quadratic forms, these two test statistics not only test 
the same thing, they are numerically equivalent. This makes 
sense, since 2SLS is the efficient linear combination of Wald 
estimators.!> 

Finally, a caveat regarding overidentification tests in prac- 
tice. Because Jy(ĝ) measures variance-normalized goodness of 
fit, the overidentification test statistic tends to be low when the 
underlying estimates are imprecise. Since IV estimates are very 
often imprecise, we cannot take much satisfaction from the 
fact that one estimate is within sampling variance of another, 
even if the individual estimates appear precise enough to be 
informative. On the other hand, in cases where the underlying 
IV estimates are quite precise, the fact that the overidentifi- 
cation test statistic rejects need not point to an identification 
failure. Rather, this may be evidence of treatment effect het- 
erogeneity, a possibility we discuss further below. On the 
conceptual side, however, an understanding of the anatomy of 
the 2SLS minimand is invaluable, for it once again highlights 


1S The fact that Wald and LM testing procedures for the same null are equiv- 
alent in linear models was established by Newey and West (1987). Angrist 
(1991) gives a more formal statement of the argument in this paragraph. An 
interesting econometric question in this context, first raised by Deaton (1985), 
is how many groups are optimal when this can be varied. The analogy between 
grouping and IV means that more groups equals more instruments, and hence 
greater asymptotic efficiency at the cost of more bias (see chapter 8). Devereux 
(2007) proposes a simple bias-corrected IV estimator for grouped data with 
many groups. 
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the important link between grouped data and IV. This link 
takes the mystery out of estimation and testing with instrumen- 
tal variables and focuses our attention on the raw moments 
that provide the foundation for causal inference. 


4.3. Two-Sample IV and Split-Sample IV* 


The GMM interpretation of 2SLS highlights the fact that IV 
estimates can be constructed from sample moments alone, 
with no microdata. Returning to the sample moment condi- 
tion, (4.2.3), and rearranging slightly produces a regression- 
like equation involving second moments: 


Zy ZW. Z'n 
= r+ (4.3.1) 


GLS estimates of T in (4.3.1) are consistent because E24] < 
E(Z@ 1. 

The 2SLS minimand can be thought of as GLS applied 
to (4.3.1), after multiplying by VN to keep the residual from 
disappearing as the sample size gets large. In other words, 
2SLS minimizes a quadratic form in the residuals from (4.3.1) 
with a (possibly nondiagonal) weighting matrix. An impor- 
tant insight that comes from writing the 2SLS problem in this 
way is that we do not need individual observations to esti- 
mate (4.3.1). Just as with the OLS coefficient vector, which 
can be constructed from the sample conditional mean function, 
IV estimates can also be censured from sample moments. 
The PE Eeay moments are 4* and ZW W The dependent vari- 


able, < , is a vector of dimension ikt Q] x 1. The regressor 
matrix, ais is of dimension [K + Q] x [K+ 1]. The IV second- 


moment equation cannot be solved exactly unless Q = 1, so it 
makes sense to make the fit as close as possible by minimizing 
a quadratic form in the residuals. The most efficient weighting 
matrix for this purpose is the asymptotic covariance matrix of 
za, This again produces the 2SLS minimand, În(ê). 

A related insight is the fact that the moment matrices on the 
left- and right-hand side of the equals sign in equation (4.3.1) 
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need not come from the same data sets, provided these data 
sets are drawn from the same population. This observation 
leads to the two-sample instrumental variables (TSIV) esti- 
mator used by Angrist (1990) and developed formally in 
Angrist and Krueger (1992).!° Briefly, let Z; and y; denote the 
instrument/covariate matrix and dependent variable vectors in 
data set 1 of size N; and let Z2 and W- denote the instrument/ 
covariate matrix and endogenous e cornet mene in 


Wo) pli m( 24 +), GLS 


estimates of the two-sample moment equation 


ZY = ar + {|e Arra) 


Ni E N2 Nı No Ni 


data set 2 of size N3. Assuming plim( 4 z 


are consistent for T. The asymptotic distribution of this estima- 
tor is obtained by normalizing by ./N; and assuming plim( Ne) 
is a constant. 

The utility of TSIV comes from the fact that it widens the 
scope for IV estimation to situations where observations on 
dependent variables, instruments, and the endogenous vari- 
able of interest are hard to find in a single sample. It may 
be easier to find one data set that has information on out- 
comes and instruments, with which the reduced form can 
be estimated, and another data set that has information on 
endogenous variables and instruments, with which the first 
stage can be estimated. For example, in Angrist (1990), admin- 
istrative records from the Social Security Administration (SSA) 
provide information on the dependent variable (annual earn- 
ings) and the instruments (draft lottery numbers coded from 
dates of birth, as well as covariates for race and year of birth). 
The SSA does not, however, track participants’ veteran status. 
This information was taken from military records, which also 
contain dates of birth that can used to code lottery numbers. 


16 Applications of TSIV include Bjorklund and Jantti (1997), Jappelli, 
Pischke, and Souleles (1998), Currie and Yelowitz (2000), and Dee and Evans 
(2003). In a recent paper, Inoue and Solon (2009) compare the asymptotic dis- 
tributions of alternative TSIV estimators and introduce a maximum likelihood 
(LIML-type) version of TSIV. They also correct a mistake in the distribution 
theory in Angrist and Krueger (1995), discussed later in this section. 
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Z, W2 

Ny ? 
the first-stage correlation between lottery numbers and vet- 
eran status conditional on race and year of birth, while the 


Angrist (1990) used these military records to construct 


Ziy 
SSA data were used to construct Na 


Two further simplifications make TSIV especially easy to 
use. First, as noted previously, when the instruments consist of 
a full set of mutually exclusive dummy variables, as in Angrist 
(1990) and Angrist and Krueger (1992), the second moment 
equation, (4.3.1), simplifies to a model for conditional means. 
In particular, the 2SLS minimand for the two-sample problem 
becomes 


In(8) = X oy -è Waj}, (4.3.2) 
j 


where J1; is the mean of the dependent variable at instrument/ 
covariate value j in one sample, W3; is the mean of endoge- 
nous variables and covariates at instrument/covariate value 
j in a second sample, and wj is an appropriate weight. This 
amounts to WLS estimation of the VIV equation, except that 
the dependent and independent variables do not come from the 
same sample. Again, Angrist (1990) and Angrist and Krueger 
(1992) provide illustrations. The optimal weights for asymp- 
totically efficient TSIV are given by the variance of Jı; — 8’ W>;. 
This variance is easy to compute if the two samples used for 
TSIV are independent. 

Second, Angrist and Krueger (1995) introduced a computa- 
tionally attractive TSIV-type estimator that requires no matrix 
manipulation and can be implemented with ordinary regres- 
sion software. This estimator, called split-sample IV (SSIV), 
works as follows.!” The first-stage estimates in data set 2 are 
(ZZ2)~!Z W2. These are carried over to data set 1 by con- 


structing cross-sample fitted values, Wiz = Z1 (Z5Z2)1Z5 W2. 


17 Angrist and Krueger called this estimator SSIV because they were con- 
cerned with a scenario where a single data set is deliberately split in two. 
As discussed in section 4.6.4, the resulting estimator may have less bias than 
conventional 2SLS. Inoue and Solon (2009) refer to the estimator Angrist and 
Krueger (1995) called SSIV as two-sample 2SLS, or TS2SLS. 
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The SSIV second stage is a regression of Yı on W12. The cor- 
rect asymptotic distribution for this estimator is derived in 
Inoue and Solon (2009), who show that the distribution pre- 
sented in Angrist and Krueger (1992) requires the assumption 
that ZZ; = ZZ (as would be true if the marginal distri- 
bution of the instruments and covariates is fixed in repeated 
samples). It’s worth noting, however, that the limiting distri- 
butions of SSIV and 2SLS are the same when the coefficient 
on the endogenous variable is zero. The standard errors for 
this special case are simple to construct and probably provide 
a reasonably good approximation to the general case.'® 


4.4 IV with Heterogeneous Potential Outcomes 


The discussion of IV up to this point postulates a constant 
causal effect. In the case of a dummy variable such as veteran 
status, this means Y1; — Yo; = p for all 7, while with a multival- 
ued treatment such as schooling, this means Ys; — Y;1, = p 
for all s and all i. Both are highly stylized views of the world, 
especially the multivalued case, which imposes linearity as 
well as homogeneity. To focus on one thing at a time in a 
heterogeneous effects model, we start with a zero-one causal 
variable, like a treatment dummy. In this context, we’d like 
to allow for treatment effect heterogeneity, in other words, a 
distribution of causal effects across individuals. 


18This shortcut formula uses the standard errors from the manual SSIV 


second stage. The correct asymptotic covariance matrix formula, from Inoue 
and Solon (2005), is 


{Bou +K E220 )Ay BY}, 


z 

where B= plim(3 2) = plim(24 Ni wa ), A= plim( Z, a= plim (2), 
plim(3*) = K,0}1 is the variance of the reduced-form residual in data set 1, 
and £22 is the variance of the first-stage residual in data set 2. In principle, 
these pieces are easy enough to calculate. Other approaches to SSIV inference 
include those of Dee and Evans (2003), who calculate standard errors for 
just-identified models using the delta method, and Bjorklund and Jantti 
(1997), who use a bootstrap. 
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Why is treatment effect heterogeneity important? The 
answer lies in the distinction between the two types of vali- 
dity that characterize a research design. Internal validity is 
the question of whether a given design successfully uncovers 
causal effects for the population being studied. A randomized 
clinical trial, or for that matter a good IV study, has a strong 
claim to internal validity. External validity is the predictive 
value of the study’s findings in a different context. For exam- 
ple, if the study population in a randomized trial is especially 
likely to benefit from treatment, the resulting estimates may 
have little external validity. Likewise, draft lottery estimates 
of the effects of conscription for service in the Vietnam era 
need not be a good measure of the consequences of voluntary 
military service. An econometric framework with heteroge- 
neous treatment effects helps us to assess both the internal 
and external validity of IV estimates.!? 


4.4.1 Local Average Treatment Effects 


In an IV framework, the engine that drives causal inference is 
the instrument z;, but the variable of interest is still p;. This 
feature of the IV setup leads us to adopt a generalized poten- 
tial outcomes concept, indexed against both instruments and 
treatment status. Let y;(d,z) denote the potential outcome of 
individual i were this person to have treatment status D; = d 
and instrument value z; = z. This tells us, for example, what 
the earnings of i would be given alternative combinations of 
veteran status and draft eligibility status. The causal effect 
of veteran status given 7’s realized draft eligibility status is 
y;(1,z;) —y;(0, z;), while the causal effect of draft eligibility 
status given 7’s veteran status is Y;(D;, 1) — Y;(D;, 0). 

We can think of instrumental variables as initiating a causal 
chain where the instrument z; affects the variable of interest, 
D;, which in turn affects outcomes, y;. To make this precise, 
we introduce notation to express the idea that the instrument 


1?The distinction between internal and external validity has a long history 
in social science. See, for example, the chapter-length discussion in Shadish, 
Cook, and Campbell (2002), the successor to a classic text on research 
methods by Campbell and Stanley (1963). 
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has a causal effect on D;. Let D1; be i’s treatment status when 
Zi = 1, while Do; is 7’s treatment status when z; = 0. Observed 
treatment status is therefore 


Dj = Do; + (D1; — Doi)Zi = Mo + MiZit &. (4.4.1) 


In random coefficients notation, 29 = E[Do;] and m1; = (D1; — 
Do;), SO 71; is the heterogeneous causal effect of the instrument 
on D;. As with potential outcomes, only one of the potential 
treatment assignments, D1; and Do;, is ever observed for any 
one person. In the draft lottery example, Do; tells us whether i 
would serve in the military if he drew a high (draft-ineligible) 
lottery number, while D4; tells us whether i would serve if he 
drew a low (draft-eligible) lottery number. We get to see one 
or the other of these potential assignments depending on Z;. 
The average causal effect of z; on D; is E[z4;]. 

The first assumption in the heterogeneous effects framework 
is that the instrument is as good as randomly assigned: it is 
independent of the vector of potential outcomes and potential 
treatment assignments. Formally, this can be written as 


[{vi(d, z); Y d, z}, Dj, Doi] LL Z;. (4.4.2) 


This independence assumption is sufficient for a causal inter- 
pretation of the reduced form, that is, the regression of yY; 
on Z;. Specifically, 


Ely;|Z; = 1] — E[Y;|z; = 0] 

= Efy;(D1j;, 1)|Z; = 1] — Elyi(Doi, 0)|Z; = 0] 

= Efy;(D1j, 1) — Yi(Doi, 9)], 
the causal effect of the instrument on y;. Independence also 
means that 

E[p,|Z; = 1] — E[p,|Z; = 0] = E[py;|z; = 1] — E[Do;|z; = 0] 
= E[D1; — Doil. 
In other words, the first stage from our earlier discussion of 
2SLS captures the causal effect of Z; on Dj. 
The second key assumption in the heterogeneous effects 

framework is the presumption that y;(d,z) is a function only 


of d. To be specific, while draft eligibility clearly affects vet- 
eran status, an individual’s potential earnings as a veteran or 
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nonveteran are assumed to be unchanged by draft eligibil- 
ity status. In general, the claim that an instrument operates 
through a single known causal channel is called an exclusion 
restriction. Formally, the exclusion restriction says that 


Y;(d, 0) = Y;(d, 1) for d = 0,1. 


In a linear model with constant effects, the exclusion restric- 
tion is expressed by the omission of instruments from the 
causal equation of interest and by saying that E[zj;n;] = 0 in 
an equation like (4.1.14) in section 4.1. It’s worth noting 
however, that the traditional error term notation used for 
simultaneous equations models doesn’t lend itself to a clear 
distinction between independence and exclusion. We need z; 
and n; to be uncorrelated, but the reasoning that lies behind 
this assumption is unclear until we consider the independence 
and exclusion restrictions as distinct propositions. 

The exclusion restriction fails for draft lottery instruments 
if men with low draft lottery numbers were affected in some 
way other than through an increased likelihood of military 
service. For example, Angrist and Krueger (1992) looked for 
an association between draft lottery numbers and schooling. 
Their idea was that educational draft deferments could have 
led men with low lottery numbers to stay in college longer than 
they would have otherwise desired. If so, draft lottery num- 
bers are correlated with earnings for at least two reasons: an 
increased likelihood of military service and an increased like- 
lihood of college attendance. The fact that the lottery number 
is randomly assigned (and therefore satisfies the independence 
assumption) does not make this possibility less likely. The 
exclusion restriction is distinct from the claim that the instru- 
ment is (as good as) randomly assigned. Rather, it is a claim 
about a unique channel for causal effects of the instrument.*° 


20 As it turns out, there is not much of a relationship between schooling and 
lottery numbers in the Angrist and Krueger (1992) data, probably because 
educational deferments were phased out during the lottery period. On the 
other hand, in a recent paper, Angrist and Chen (2007) argue that Vietnam 
veterans end up with more schooling because of veterans benefits (known 
as the GI Bill). Extra schooling via the GI Bill does not violate the exclu- 
sion restriction because veterans’ benefits are a downstream consequence of 
military service. 
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Using the exclusion restriction, we can define potential out- 
comes indexed solely against treatment status using the single 
index (Y1;,Yo;) notation we have been using all along. In 
particular, 


i(0, 0). (4.4.3) 


The observed outcome, Y;, can therefore be written in terms 
of potential outcomes as: 


Y; = Y;(0, Z;) + [¥i(1, Zi) — Y:(0, z:)]D; (4.4.4) 
= Yo; + (Y1; — Yo;)D;. 


Random coefficients notation for this is 
Y; = Qo + piDi + Nis 


a compact version of (4.4.4) with ao = E[yo;] and p; = 
Y1; — Yoi- 

A final assumption needed for heterogeneous IV models is 
that either z1; > 0 for all i or 2; < 0 for all 7. This monotonic- 
ity assumption, introduced by Imbens and Angrist (1994), 
means that while the instrument may have no effect on some 
people, all those who are affected are affected in the same way. 
In other words, either D1; > Do; Or D1; < Do; for all 7. In what 
follows, we assume monotonicity holds with D4; > Do;. In the 
draft lottery example, this means that although draft eligibility 
may have had no effect on the probability of military service 
for some men, there is no one who was actually kept out of 
the military by being draft eligible. Without monotonicity, IV 
estimators are not guaranteed to estimate a weighted average 
of the underlying individual causal effects, Y1; — Yo;. 

Given the exclusion restriction, the independence of instru- 
ments and potential outcomes, the existence of a first stage, 
and monotonicity, the Wald estimand can be interpreted as 
the effect of veteran status on those whose treatment status 
can be changed by the instrument. This parameter is called 
the local average treatment effect (LATE; Imbens and Angrist, 
1994). Here is a formal statement: 
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Theorem 4.4.1 The LATE Theorem. Suppose 
(A1, Independence) {Y;(D1j, 1), ¥i(Doj, 0), Dii, Doi} L Zi; 
(A2, Exclusion) y;(d,0) = y;(d,1) = Y4; for d = 0,1; 
(A3, First stage) E[D1; — Do;] 4 0; 
(A4, Monotonicity) D1; — Do; > 0 Vi, or vice versa; 
Then 


Efy;|Z; = 1] — Ely;|z; = 0] 
E{p;|z; = 1] — E[p;|z; = 0] 
= E[p;|m; > 0]. 


Proof. Use the exclusion restriction to write E[y;|z; = 
1] = Elyo; + (Y1; — Yoi)Di|Z; = 1], which equals E[yo; + (Y1; — 
Yoi)D1;] by independence.*! Likewise E[y;|z; = 0] = E[Yo; + 
(Yii — Yoi)Do;], so the numerator of the Wald estimator is 
E[(¥1; — Yoi)(D1i — Do;)]. Monotonicity means D1; — Do; equals 
one or zero, so 


E[(¥1; — Yoi)(D1i — Doi)] = Ely1; — Yo;|D1; > Do;]P[D1i > Doil. 
A similar argument shows 
E[p;|Z; = 1] — E[p;|z; = 0] = E[D1; — Doi] = P[D1; > Doi). 


This theorem says that an instrument that is as good as ran- 
domly assigned, affects the outcome through a single known 
channel, has a first stage, and affects the causal channel of 
interest only in one direction can be used to estimate the aver- 
age causal effect on the affected group. Thus, IV estimates 
of effects of military service using the draft lottery capture the 
effect of military service on men who served because they were 
draft eligible but who would not otherwise have served. This 
excludes volunteers and men who were exempted from mili- 
tary service for medical reasons, but it includes men for whom 
draft policy was binding. 

How useful is LATE? No theorem answers this question, but 
it’s always worth discussing. Part of the interest in the effects of 


21Note that the statement of independence in A1 has been simplified from 
(4.4.2) to cover only those values of y;(d,z) that we might see, specifically, 
Y;(Dz; » Zi). 
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Vietnam-era service has to do with the question of whether vet- 
erans (especially conscripts) were adequately compensated for 
their service. Internally valid draft lottery estimates answer this 
question. Draft lottery estimates of the effects of Vietnam-era 
conscription may also be relevant for discussions of any future 
conscription policy. On the other hand, while draft lottery 
instruments produce internally valid estimates of the causal 
effect of Vietnam-era conscription, the external validity—that 
is, the predictive value of these estimates for military service 
in other times and places—is not directly addressed by the 
IV framework. There is nothing in IV formulas to explain 
why Vietnam-era service affects earnings; for that, you need 
a theory.” 

You might wonder why we need monotonicity for the LATE 
theorem, an assumption that plays no role in the traditional 
simultaneous equations framework with constant effects. A 
failure of monotonicity means the instrument pushes some 
people into treatment while pushing others out. Angrist, 
Imbens, and Rubin (1996) call the latter group defiers. Defiers 
complicate the link between LATE and the reduced form. To 
see why, go back to the step in the proof of the LATE theorem 
that shows the reduced form is 


Efy;|Z; = 1] — E[y,|z; = 0] = E[(Y1; — Yo;)(D1; — Dos)]. 


Without monotonicity, this is equal to 


E[Y1; — Yoi|D1i > Doi]P[D1; > Doi] 
— E[Y1; — Yo;|D1; < Doi]P[D1; < Doi]. 


We might therefore have a scenario where treatment effects 
are positive for everyone yet the reduced form is zero because 
effects on compliers are canceled out by effects on defiers. 
This doesn’t come up in a constant effects model because 
the reduced form is always the constant effect times the first 


22 Angrist (1990) interprets draft lottery estimates as the penalty for lost 
labor market experience. This suggests draft lottery estimates should have 
external validity for the effects of conscription in other periods, a conjecture 
born out by the results for World War II draftees in Angrist and Krueger 
(1994). 


Instrumental Variables in Action 157 


stage, regardless of whether the first stage includes defiant 
behavior.?? 

A deeper understanding of LATE can be had by linking it to 
a workhorse of contemporary econometrics, the latent index 
model for dummy endogenous variables such as assignment 
to treatment. Latent index models describe individual choices 
as being determined by a comparison of partly observed and 
partly unknown (“latent”) utilities and costs (see, e.g., Heck- 
man, 1978). Typically, these unobservables are thought of as 
being related to outcomes, in which case the treatment variable 
is said to be endogenous (though it is not really endogenous in 
a simultaneous equations sense). For example, we can model 
veteran status as 


Races 1 if yo + y1Zi > Vi 
*—— |0 otherwise 


where v; is a random factor involving unobserved costs and 
benefits of military service assumed to be independent of z;. 
This latent index model characterizes potential treatment 
assignments as 


Do; = 1[yo > vi] and Di; = 1[yo + y1 > vil. 


Note thatin this model, monotonicity is automatically satisfied 
since yı is a constant. Assuming yı > 0, LATE can be written 


E[Y1; — YoiID1i > Doi] = ElY1; — Yo;lyo + V1 > vi > yol, 


which is a function of the latent first-stage parameters, yo 
and yı, as well as the joint distribution of Y1; — Yo; and vj. 
This is not, in general, the same as the unconditional aver- 
age treatment effect, E[Y1; — Yo;], or the effect on the treated, 


23 With a constant effect, p, 


E[Y1; — Yo;lD1; > Do;lP[D1; > Doi] — ElY1; — Yo;|D1; < Do;lP[D1; < Doi] 
= p{P[D1; > Doi] — P[D1; < Doi} 
= p{E[D1; — Do;]}. 


So a zero reduced-form effect means either the first stage is zero or p = 0. 
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E[y¥1; — Yo;|D; = 1]. We explore the distinction between differ- 
ent average causal effects in the next section. 


4.4.2 The Compliant Subpopulation 


The LATE framework partitions any population with an 
instrument into a set of three instrument-dependent sub- 
groups, defined by the manner in which members of the 
population react to the instrument: 


Definition 4.4.1 Compliers. The subpopulation with Di; = 1 
and Doj = 0. 
Always-takers. The subpopulation with D4; = Do; = 1. 
Never-takers. The subpopulation with D1; = Do; = 0. 


LATE is the effect of treatment on the population of com- 
pliers. The term “compliers” comes from an analogy with 
randomized trials where some experimental subjects comply 
with the randomly assigned treatment protocol (e.g., take 
their medicine) but some do not, while some control subjects 
obtain access to the experimental treatment even though they 
are not supposed to. Those who don’t take their medicine 
when randomly assigned to do so are never-takers, while 
those who take the medicine even when put into the control 
group are always-takers. Without adding further assumptions 
(e.g., constant causal effects), LATE is not informative about 
effects on never-takers and always-takers because, by defini- 
tion, treatment status for these two groups is unchanged by the 
instrument. The analogy between IV and a randomized trial 
with partial compliance is more than allegorical: IV solves the 
problem of causal inference in a randomized trial with partial 
compliance. This important point merits a separate subsection, 
below. 

Before turning to this important special case, we make a few 
general points. First, the average causal effect on compliers is 
not usually the same as the average treatment effect on the 
treated. From the simple fact that D; = Do; + (D1; — Doj)Zi, we 
learn that the treated population consists of two nonoverlap- 
ping groups. By monotonicity, we cannot have both Do; = 1 
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and D4; — Do; = 1, since Do; = 1 implies D1; = 1. The treated 
therefore have either Do; = 1 or D1; — Do; = 1 and z; = 1, and 
hence D; can be written as the sum of two mutually exclu- 
sive dummies, Dj) and (Dj; —Dpo;)Z;. In other words, the 
treated consist of either always-takers or of compliers with 
the instrument switched on. Since the instrument is as good as 
randomly assigned, compliers with the instrument switched 
on are representative of all compliers. From here we get 
Ely1i — Yoi|Di = 1] 
————~ 


~ 
a 
Effect on the treated 


= Elyii— Yoi|Doi = UP[Do; = 11D; = 1] 
a 
Effect on always-takers 


+ Elvi; — Yo;lD1; > Do;lP[D1; > Do; Z; = 1|D; = 1]. 
— m 


Effect on compliers 


(4.4.5) 


Since P[Do; = 1|D; = 1] and P[p4; > Doj, Zi = 1|D; = 1] add up 
to one, this means that the effect of treatment on the treated is 
a weighted average of effects on always-takers and compliers. 

Likewise, LATE is not the average causal effect of treat- 
ment on the nontreated, E[Y1; — Yo;|D; = 0]. In the draft lottery 
example, the average effect on the nontreated is the aver- 
age causal effect of military service on the population of 
nonveterans from Vietnam-era cohorts. The average effect of 
treatment on the nontreated is a weighted average of effects 
on never-takers and compliers. In particular, 


E[y1i — Yo;|D; = 0] 
Effect on the nontreated 
= E[Y1; — Yo;|D1; = 0]P[D1; = 0|D; = 0] 
Effect on never-takers 
+ ElY1; — Yoi|D1; > Do;]P[D1; > Do;, Z; = O|D; = 0], 
Effect on compliers 


(4.4.6) 


where we use the fact that, by monotonicity, those with 
D1; = 0 must be never-takers. 
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Finally, averaging (4.4.5) and (4.4.6) using 


Ely4;— Yoi] = Elvi; — Yo;|D; = 1]P[p; = 1] 
+ Elyi; — Yoi|D; = 0]P[p; = 0] 


shows the unconditional average treatment effect to be a 
weighted average of effects on compliers, always-takers, and 
never-takers. Of course, this is a conclusion we could have 
reached directly given monotonicity and definition (4.4.1). 

Because an IV is not directly informative about effects on 
always-takers and never-takers, instruments do not usually 
capture the average causal effect on all of the treated or on all 
of the nontreated. There are important exceptions to this rule, 
however: instrumental variables that allow no always-takers 
or no never-takers. Although this scenario is not typical, it is 
an important special case. One example is the twins instrument 
for fertility, used by Rosenzweig and Wolpin (1980), Bronars 
and Grogger (1994), Angrist and Evans (1998), and Angrist, 
Lavy, and Schlosser (2006). Another is Oreopoulos’s (2006) 
recent study using changes in compulsory attendance laws as 
instruments for schooling in Britain. 

To see how this special case works with twins instruments, 
let T; be a dummy variable indicating multiple second births. 
Angrist and Evans (1998) used this instrument to estimate 
the causal effect of having three children on earnings in the 
population of women with at least two children. The third 
child is especially interesting because reduced fertility for 
American wives in the 1960s and 1970s meant a switch from 
three children to two. Multiple second births provide quasi- 
experimental variation on this margin. Let Yo; denote potential 
earnings if a woman has only two children while Y1; denotes 
her potential earnings if she has three, an event indicated by 
p;. Assuming that T; is as good as randomly assigned, that 
fertility increases by at most one child in response to a mul- 
tiple birth, and that multiple births affect outcomes only by 
increasing fertility, LATE using the twins instrument T; is also 
E[y¥1; — Yo;|D; = 0], the average causal effect on women who 
are not treated (i.e., have two children only). This is because 
all women who have a multiple second birth end up with three 
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children, that is, there are no never-takers in response to the 
twins instrument. 

Oreopoulos (2006) also uses IV to estimate an average 
causal effect of treatment on the nontreated. His study esti- 
mates the economic returns to schooling using an increase in 
the British compulsory attendance age from 14 to 15. Com- 
pliance with the Britain’s new compulsory attendance law 
was near perfect, though many teens would previously have 
dropped out of school at age 14. The causal effect of interest in 
this case is the earnings premium for an additional year of high 
school. Finishing this year can be thought of as the treatment. 
Since everybody in Oreopoulos’s British sample finished an 
additional year when compulsory schooling laws were made 
stricter, there are no never-takers. Oreopoulos’s IV strategy 
therefore captures the average causal effect of obtaining one 
more year of high school on all those who leave school at 14. 
This turns on the fact that British teens are remarkably law- 
abiding people—Oreopoulos’s IV strategy wouldn’t estimate 
the effect of treatment on the nontreated in, say, Israel, where 
teenagers get more leeway when it comes to compulsory school 
attendance. Israeli econometricians using changes in compul- 
sory attendance laws as instruments must therefore make do 
with LATE. 


4.4.3. IV in Randomized Trials 


The language of the LATE framework is based on an anal- 
ogy between IV and randomized trials. But some instruments 
really do come from randomized trials. If the instrument is 
a randomly assigned offer of treatment, then LATE is the 
effect of treatment on those who comply with the offer but 
are not treated otherwise. An especially important case is 
when the instrument is generated by a randomized trial with 
one-sided noncompliance. In many randomized trials, par- 
ticipation is voluntary among those randomly assigned to 
receive treatment. On the other hand, no one in the control 
group has access to the experimental intervention. Since the 
group that receives (i.e., complies with) the assigned treat- 
ment is a self-selected subset of those offered treatment, a 
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comparison between those actually treated and the control 
group is misleading. The selection bias in this case is almost 
always positive: those who take their medicine in a random- 
ized trial tend to be healthier; those who take advantage of 
randomly assigned economic interventions such as training 
programs tend to earn more anyway. 

IV using randomly assigned treatment intended as an instru- 
mental variable for treatment received solves this sort of 
compliance problem. Moreover, LATE is the effect of treat- 
ment on the treated in this case. Suppose the instrument Z; is a 
dummy variable indicating random assignment to a treatment 
group, while D; is a dummy indicating whether treatment was 
actually received. In practice, because of noncompliance, D; 
is not equal to z;. An example is the randomized evaluation 
of the JTPA training program, where only 60 percent of those 
assigned to be trained received training, while roughly 2 per- 
cent of those assigned to the control group received training 
anyway (Bloom et al., 1997; see also section 7.2.1). Non- 
compliance in the JTPA arose from lack of interest among 
participants and the failure of program operators to encour- 
age participation. Since the compliance problem in this case 
was largely confined to the treatment group, LATE using ran- 
dom assignment, Z;, as an instrument for treatment received, 
Dj, is the effect of treatment on the treated. 

The use of IV to solve compliance problems is illustrated in 
table 4.4.1, which presents results from the JTPA experiment. 
The outcome variable of primary interest in the JTPA exper- 
iment is total earnings in the 30-month period after random 
assignment. Columns 1 and 2 of the table show the difference 
in earnings between those who were trained and those who 
were not (the OLS estimates in column 2 are from a regression 
model that adjusts for a number of individual characteristics 
measured at the beginning of the experiment). The contrast 
reported in columns 1 and 2 is on the order of $4,000 for 
men and $2,200 for women, in both cases a large treatment 
difference that amounts to about 20 percent of average earn- 
ings. But these estimates are misleading because they compare 
individuals according to D;, the actual treatment received. 
Since individuals assigned to the treatment group were free to 
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TABLE 4.4.1 
Results from the JTPA experiment: OLS and IV estimates of training impacts 
Comparisons by Comparisons by Instrumental Variable 
Training Status (OLS) Assignment Status (ITT) Estimates (IV) 
Without With Without With Without With 
Covariates Covariates Covariates Covariates Covariates Covariates 
(1) (2) (3) (4) (5) (6) 
A. Men 3,970 3,754 1,117 970 1,825 1,593 
(555) (536) (569) (546) (928) (895) 
B. Women 2,133 2,215 1,243 1,139 1,942 1,780 
(345) (334) (359) (341) (560) (532) 


Notes: Authors’ tabulation of JTPA study data. The table reports OLS, ITT, and IV estimates 
of the effect of subsidized training on earnings in the JTPA experiment. Columns 1 and 2 
show differences in earnings by training status; columns 3 and 4 show differences by random- 
assignment status. Columns 5 and 6 report the result of using random-assignment status as an 
instrument for training. The covariates used for columns 2, 4, and 6 are high school or GED, 
black, Hispanic, married, worked less than 13 weeks in past year, AFDC (for women), plus 
indicators for the JTPA service strategy recommended, age group, and second follow-up survey. 
Robust standard errors are shown in parentheses. There are 5,102 men and 6,102 women in 
the sample. 


decline (and 40 percent did so), this comparison throws away 
the random assignment unless the decision to comply was itself 
independent of potential outcomes. This seems unlikely. 
Columns 3 and 4 of table 4.4.1 compare individuals accord- 
ing to whether they were offered treatment. In other words, 
this comparison is based on randomly assigned z;. In the lan- 
guage of clinical trials, the contrast in columns 3 and 4 is 
known as an intention-to-treat (ITT) effect. The ITT effects 
in the table are on the order of $1,200 (somewhat less with 
covariates). Since Z; was randomly assigned, ITT effects have 
a causal interpretation: they tell us the causal effect of the offer 
of treatment, building in the fact that many of those offered 
have declined to participate. For this reason, the ITT effect is 
too small relative to the average causal effect on those who 
were in fact treated. Columns 5 and 6 put the pieces together 
and give us the most interesting effect: ITT divided by the 
difference in compliance rates between treatment and control 
groups as originally assigned (about .6). These figures, roughly 
$1,800, measure the effect of treatment on the treated. 
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How do we know that ITT divided by compliance is the 
effect of treatment on the treated? We can recognize ITT as 
the reduced-form effect of the randomly assigned offer of treat- 
ment, our instrument in this case. The compliance rate is the 
first stage associated with this instrument, and the Wald esti- 
mand, as always, is the reduced form divided by the first stage. 
In general, this equals LATE, but because we have (almost) 
no always-takers, the treated population consists (almost) 
entirely of compliers. The IV estimates in columns 5 and 6 
of table 4.4.1 are therefore consistent estimates of the effect of 
treatment on the treated. 

This conclusion is important enough that it warrants an 
alternative derivation. To the best of our knowledge, the first 
person to point out that the IV formula can be used to estimate 
the effect of treatment on the treated in a randomized trial with 
one-sided noncompliance was Howard Bloom (1984). Here is 
Bloom’s result with a simple direct proof. 


Theorem 4.4.2 The Bloom Result. Suppose the assumptions 
of the LATE theorem hold, and E{p;|z; = 0] = P[p; = 1|z; = 
0] = 0. Then 
Efy;|Z; = 1] — Ely;|z; = 0] 
P[p; = 1|z; = 1] 


= E[Y1; — Yo;lD; = 1). 


Proof. ElY;|z; = 1] = Efyoilzi = 1] + El(vii — Yo;)D;|z; = 1], 
while E[y,|z; = 0] = E[yo;|z; = 0] because z;=0 implies 
D; = 0. Therefore 


Ely;|Zi = 1] — E[Y;|z; = 0] = El(yii — Yo;)D;|z; = 1], 
since E[Yo;|z; = 0] = E[Yo;|z; = 1] by independence. But 


El(¥1i — Yo;)D;|z; = 1] 
= Efyi1; — Yo;lD; = 1, z; = 1)P[p; = 1|z; = 1], 
while D; = 1 implies z; = 1, since no one with z; = 0 is treated. 


Hence, E[Y1; — Yo;|D; = 1, z; = 1] = Elvi; — Yo;|D; = 1]. 


In addition to telling us how to analyze randomized tri- 
als with noncompliance, the LATE framework opens the 
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door to cleverly designed randomized experiments in settings 
where it’s impossible or unethical to compel treatment com- 
pliance. A creative example from the field of criminology is 
the Minneapolis Domestic Violence Experiment (MDVE). The 
MDVE was a pioneering effort to determine the best police 
response to domestic violence (Sherman and Berk, 1984). 
In general, police use a number of strategies in response to 
a domestic violence call. These include referral to counsel- 
ing, separation orders, and arrest. A vigorous debate swirls 
around the question of whether a hard-line response—arrest 
and at least temporary incarceration—is productive, especially 
in view of the fact that domestic assault charges are frequently 
dropped. 

As a result of this debate, the city of Minneapolis authorized 
a randomized trial where the police response to a domestic 
disturbance call was determined in part by random assign- 
ment. The research design used randomly shuffled color-coded 
report forms telling the responding officers to arrest some 
perpetrators while referring others to counseling or merely sep- 
arating the parties. In practice, however, the police were free 
to overrule the random assignment. For example, an especially 
dangerous or drunk offender was arrested no matter what. As 
a result, the actual response often deviated from the randomly 
assigned response, though the two are highly correlated. 

Most published analyses of the MDVE data recognize this 
compliance problem and focus on ITT effects, that is, they use 
the original random assignment and not the treatment actu- 
ally delivered. But the MDVE data can also be used to get 
the average causal effect on compliers, in this case those who 
were not arrested because they were randomly assigned to be 
treated differently but would have been arrested otherwise. 
The MDVE is analyzed in this spirit in Angrist (2006). Because 
everyone in the MDVE who was assigned to be arrested was 
in fact arrested, there are no never-takers. This is an interest- 
ing twist and the flip side of the Bloom scenario: here we have 
Dı; = 1 for everybody. Consequently, LATE is the effect of 
treatment on the nontreated, that is, 


E[Y1; — YoilD1i > Doi] = Elvi; — Yoi|D; = 0], 
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where D; indicates arrest. The IV estimates using MDVE data 
show that the average causal effect of arrest is to reduce repeat 
offenses sharply, in this case, among the subpopulation not 
arrested.** 


4.4.4 Counting and Characterizing Compliers 


We've seen that, except in special cases, each instrumental 
variable identifies a unique causal parameter, one specific to 
the subpopulation of compliers for that instrument. Differ- 
ent valid instruments for the same causal relation therefore 
estimate different things, at least in principle (an important 
exception being instruments that allow for perfect compliance 
on one side or the other). Although different IV estimates are 
implicitly weighted up by 2SLS to produce a single average 
causal effect, overidentification testing of the sort discussed 
in section 4.2.2, where multiple instruments are validated 
according to whether or not they estimate the same thing, is 
out the window in a fully heterogeneous world. 

Differences in compliant subpopulations might explain vari- 
ability in treatment effects from one instrument to another. 
We would therefore like to learn as much as we can about the 
compliers for different instruments. Moreover, if the compli- 
ant subpopulation is similar to other populations of interest, 
the case for extrapolating estimated causal effects to these 
other populations is stronger. In this spirit, Acemoglu and 
Angrist (2000) argue that quarter-of-birth instruments and 
state compulsory attendance laws (specifically, the minimum 
schooling required before leaving school in your state of birth) 
affect essentially the same group of people and for the same 
reasons. We therefore expect IV estimates of the returns to 


24The Krueger (1999) study discussed in chapter 2 also uses IV to ana- 
lyze data from a randomized trial. Specifically, this study uses randomly 
assigned class size as an instrument for actual class size with data from the Ten- 
nessee STAR experiment. For students in first grade and higher, actual class 
size differs from randomly assigned class size because parents and teachers 
moved students around in years after the experiment began. Krueger (1999) 
also illustrates 2SLS applied to a model with variable treatment intensity, as 
discussed in section 4.5.3. 
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schooling from quarter-of-birth and compulsory schooling 
instruments to be similar. We might also expect the quarter- 
of-birth estimates to predict the impact of contemporary 
proposals to strengthen compulsory attendance laws. 

On the other hand, if the compliant subpopulations asso- 
ciated with two or more instruments are very different, yet 
the IV estimates they generate are similar, we might be pre- 
pared to adopt homogeneous effects as a working hypothesis. 
This revives the overidentification idea but puts it at the ser- 
vice of external validity.2° This reasoning is illustrated in a 
study of the effects of family size on children’s education by 
Angrist, Lavy, and Schlosser (2006). The Angrist, Lavy, and 
Schlosser study was motivated by the observation that chil- 
dren from larger families typically end up with less education 
than those from smaller families. A longstanding concern in 
research on fertility is whether the observed negative corre- 
lation between larger families and worse outcomes is causal. 
As it turns out, IV estimates of the effect of family size using a 
number of different instruments, each with very different com- 
pliant subpopulations, all generate results showing no effect 
of family size. Angrist, Lavy, and Schlosser (2006) argue that 
their results point to a common family size effect of zero for 
just about everybody in the Israeli population they studied.”° 

We have already seen that the size of a complier group is 
easy to measure. This is just the Wald first stage, since, given 
monotonicity, we have 


= E[p,|z; = 1] — E[p;|z; = 0]. 


We can also tell what proportion of the treated are compliers 
since, for compliers, treatment status is completely determined 


25In fact, maintaining the hypothesis that all instruments in an overidenti- 
fied model are valid, the traditional overidentification test statistic becomes a 
formal test for treatment effect homogeneity. 

26See also Black, Devereux, and Salvones (2005) for similar results from 
Norway. 
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by z;. Start with the definition of conditional probability: 


P[D;=1|D1; > DosJP[D1i > Doi] 
P[p; = 1] 
P[z; = 1)(E[pi|Z; = 1] — E[D;lz: = 01) 


= : (4.4.7) 
P[p; = 1] 


P[D1; > Do;|D;= 1] = 


The second equality uses the facts that P[D; = 1|D1; > Do;] = 
Plz; = 1|D4; > Dol and that Plz; = 1|D4; > Doi] = Plz; = 1] 
by independence. In other words, the proportion of the treated 
who are compliers is given by the first stage, times the proba- 
bility the instrument is switched on, divided by the proportion 
treated. 

Formula (4.4.7) is illustrated here by calculating the pro- 
portion of veterans who are draft lottery compliers. The 
ingredients are reported in the first two rows of table 4.4.2. For 
example, for white men born in 1950, the first stage is .159, the 
probability of draft eligibility is #22, and the marginal proba- 
bility of treatment is .267. From these statistics, we compute 
that the compliant subpopulation is .32 of the veteran pop- 
ulation in this group. The proportion of veterans who were 
draft lottery compliers falls to 20 percent for nonwhite men 
born in 1950. This is not surprising, since the draft lottery first 
stage is considerably weaker for nonwhites. The last column 
of the table reports the proportion of nonveterans who would 
have served if they had been draft eligible. This ranges from 
about 3 percent of nonwhites to 10 percent of whites, reflect- 
ing the fact that most nonveterans were deferred, ineligible, or 
unqualified for military service. 

The effect of compulsory military service is the parameter 
of primary interest in the Angrist (1990) study, so the fact 
that draft eligibility compliers are a minority of veterans is 
not really a limitation of this study. Even in the Vietnam 
era, most soldiers were volunteers, a little appreciated fact 
about Vietnam-era veterans. The LATE interpretation of IV 
estimates using the draft lottery highlights the fact that other 
identification strategies are needed to estimate the effects of 
military service on volunteers (some of these are implemented 
in Angrist, 1998). 


Probabilities of compliance in instrumental variables studies 


TABLE 4.4.2 


Compliance Probabilities 


Endogenous First Stage, 
Source Variable (D) Instrument (z) Sample PiIp=1]) Plpy>vdp9) Plz=1)) Pippy > polo=1] Pp] > D9|ID = 0] 
(1) (2) 3) (4) (5) (6) 7) (8) (9) 
Angrist (1990) Veteran status Draft eligibility White men born in 267 159 534 318 101 
1950 
Non-white men born in .163 .060 534 197 .033 
1950 
Angrist and Evans More than two Twins at second Married women aged 381 603 008 013 966 
(1998) children birth 21-35 with two or 
more children in 1980 
First two children 381 .060 506 -080 .048 
are same sex 
Angrist and High school grad- Third- or fourth- Men born between -770 .016 509 011 034 
Krueger (1991) uate quarter birth 1930 and 1939 
Acemoglu and High school grad- State requires 11 White men aged 40-49 .617 .037 .300 .018 .068 


Angrist (2000) 


uate 


or more years of 
school attendance 


Notes: The table computes the absolute and relative size of the complier population for a number of instrumental variables. The first 
stage, reported in column 6, gives the absolute size of the complier group. Columns 8 and 9 show the size of the complier population 


relative to the treated and untreated populations. 
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The remaining rows in table 4.4.2 show the size of the com- 
pliant subpopulation for the twins and sibling sex composition 
instruments used by Angrist and Evans (1998) to estimate 
the effects of childbearing and for the quarter-of-birth instru- 
ments and compulsory attendance laws used by Angrist and 
Krueger (1991) and Acemoglu and Angrist (2000) to estimate 
the returns to schooling. In each of these studies, the com- 
pliant subpopulation is a small fraction of the treated group. 
For example, less than 2 percent of those who graduated from 
high school did so because of compulsory attendance laws or 
by virtue of having been born in a late quarter. 

The question of whether a small compliant subpopulation 
is a cause for worry is context-specific. In some cases, it seems 
fair to say, “you get what you need.” With many policy inter- 
ventions, for example, it isa marginal group that is of primary 
interest, a point emphasized in McClellan, McNeil, and New- 
house’s (1994) landmark IV study of the effects of surgery 
on heart attack patients. They used the relative distance to 
cardiac care facilities to construct instruments for whether an 
elderly heart attack patient was treated with a surgical inter- 
vention. Most patients get the same treatment either way, but 
for some, the proper course of action (or at least the received 
wisdom as to the proper course of action) is uncertain. In such 
cases, health care providers or patients opt for a more inva- 
sive strategy only if a well-equipped surgical facility is close 
by. McClellan et al. found little benefit from surgical proce- 
dures for this marginal group. Similarly, an increase in the 
compulsory attendance cut-off to age 18 is clearly irrelevant 
for the majority of American high school students, but affects 
some who would otherwise drop out. IV estimates suggest 
the economic returns to schooling for this marginal group are 
substantial. 

The last column of table 4.4.2 illustrates the special feature 
of twins instruments alluded to at the end of section 4.4.2. As 
before, let D; = 0 for women with two children in a sample of 
women with at least two children and 1 for women who have 
more than two. Because there are no never-takers in response 
to the event of a multiple birth—all mothers who have twins 
at second birth end up with (at least) three children—the 
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probability of compliance among those with D; = 0 is virtually 
one (the table shows an entry of .97). LATE is therefore the 
effect on the nontreated, E[y1; — Yo;|D; = 0], in this case. 

Unlike the size of the complier group, information on the 
characteristics of compliers seems like a tall order because the 
compliers cannot be individually identified. Because we can’t 
see both D1; and Dg; for each individual, we can’t just list those 
with D1; > Do; and then calculate the distribution of charac- 
teristics for this group. Nevertheless, in spite of the fact that 
compliers cannot be listed or named, it’s easy to describe the 
distribution of complier characteristics. To simplify, we focus 
here on characteristics, such as race or degree completion, that 
can be described by dummy variables. In this case, everything 
we need to know can be learned from variation in the first 
stage across covariate groups. 

Let xı; be a Bernoulli-distributed characteristic, say a 
dummy indicating college graduates. Are sex composition 
compliers more or less likely to be college graduates than other 
women with two children? This question is answered by the 
following calculation: 


Plxii = 1[Dii > Dol — PlDii > Dosa = 1] 
Pix; = 1] P[D1; > Doil 
E[p,\Z; = 1, x1; = 1] — E[p,|z; = 0, x1; = 1] 
~ E(pjlz; = 1) — E[pilz; = 0] 


(4.4.8) 


In other words, the relative likelihood a complier is a col- 
lege graduate is given by the ratio of the first stage for college 
graduates to the overall first stage.*” 


27A general method for constructing the mean or other features of the dis- 
tribution of covariates for compliers uses Abadie’s (2003) kappa-weighting 
scheme. For example, 


E[X|D1; = Elk;X;] 
ilD1i > Doil = Elk] > 
where 

D;(1— zi) (1 — D;)Zi 


— 1 is 
“ T—P(z;= 1X) Pz = 11%) 


This works because the weighting function, «;, “finds compliers,” in a sense 
discussed in section 4.5.2. 
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TABLE 4.4.3 
Complier characteristics ratios for twins and sex composition instruments 


Twins at Second Birth First Two Children Are Same Sex 


Plxqj=1) Pixu = 1|D1; > Dosl/ Pixu = 1) Pix; = 11D1; > Doi] / 


Plxij=1]) D1; > Doil Pix; = 1) Dij > Doi] Plxij = 1] 

Variable (1) (2) (3) (4) (5) 
Age 30 or .0029 .004 1.39 .0023 .995 
older at 

first birth 

Black or .125 .103 .822 .102 .814 
hispanic 

High school .822 .861 1.048 .815 .998 
graduate 

College .132 .151 1.14 .0904 .704 
graduate 


Notes: The table reports an analysis of complier characteristics for twins and sex compo- 
sition instruments. The ratios in columns 3 and 5 give the relative likelihood that compliers 
have the characteristic indicated at left. Data are from the 1980 census 5 percent sample, 
including married mothers aged 21-35 with at least two children, as in Angrist and Evans 
(1998). The sample size is 254,654 for all columns. 


This calculation is illustrated in table 4.4.3, which reports 
compliers’ characteristics ratios for age at first birth, non- 
white race, and degree completion using twins and same-sex 
instruments. The table was constructed from the Angrist and 
Evans (1998) extract from the 1980 census containing mar- 
ried women aged 21-35 with at least two children. Twins 
compliers are much more likely to be over 30 than the aver- 
age mother in the sample, reflecting the fact that younger 
women who had a multiple birth were more likely to go 
on to have additional children anyway (though over-30 first 
births are rare for all women in the Angrist-Evans sample). 
Twins compliers are also more educated than the average 
mother, while sex composition compliers are less educated. 
This helps to explain the smaller 2SLS estimates generated by 
twins instruments (reported here in table 4.1.4), since Angrist 
and Evans (1998) show that the labor supply consequences of 
childbearing decline with mother’s schooling. 
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4.5 Generalizing LATE 


The LATE theorem applies to a stripped-down causal model 
in which a single dummy instrument is used to estimate the 
impact of a dummy treatment with no covariates. We can gen- 
eralize this in three important ways: multiple instruments (e.g., 
a set of quarter-of-birth dummies), models with covariates 
(e.g., controls for year of birth), and models with variable and 
continuous treatment intensity (e.g., years of schooling). In 
all three cases, the IV estimand is a weighted average of causal 
effects for instrument-specific compliers. The econometric tool 
remains 2SLS and the interpretation remains fundamentally 
similar to the basic LATE result, with a few bells and whis- 
tles. 2SLS with multiple instruments produces a causal effect 
that averages IV estimands using the instruments one at a time; 
2SLS with covariates produces an average of covariate-specific 
LATEs; 2SLS with variable or continuous treatment intensity 
produces a weighted average derivative along the length of a 
possibly nonlinear causal response function. These results pro- 
vide a simple casual interpretation for 2SLS in most empirically 
relevant settings. 


4.5.1 LATE with Multiple Instruments 


The multiple-instruments extension is easy to see. This is essen- 
tially the same as a result we discussed in the grouped data 
context. Consider a pair of dummy instruments, Z1; and Zz. 
Without loss of generality, assume these dummies are mutually 
exclusive (if not, then we can work with a mutually exclusive 
set of three dummies, Zi(1 = Z2i)5 Z2(1 = Z1i)s and Z1iZ2i). The 
two dummies can be used to construct Wald estimators. Again 
without loss of generality, assume monotonicity is satisfied 
for each with a positive first stage (if not, we can recode the 
dummies so this is true). Both therefore estimate a version of 
E[Y1; — Yo;|D1; > Do;], though the population with D1; > Do; 
differs for Z1; and Z;. 

Instead of Wald estimators, we can use Z1; and Z2; together 
in a 2SLS procedure. Since these two dummies and a constant 
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exhaust the information in the instrument set, this 2SLS proce- 
dure is the same as grouped data estimation using conditional 
means defined given Z1; and Zz; (whether or not the instru- 
ments are correlated). As in Angrist (1991), the resulting 
grouped data estimator is a linear combination of the underly- 
ing Wald estimators. In other words, it is a linear combination 
of the instrument-specific LATEs using the instruments one 
at a time (in fact, it is the efficient linear combination in a 
traditional homoskedastic linear constant effects model). 

This argument is not quite complete, since we haven’t shown 
that the linear combination of LATEs produced by 2SLS is also 
a weighted average (i.e., the weights are non-negative and sum 
to one). The relevant weighting formulas appear in Imbens and 
Angrist (1994) and Angrist and Imbens (1995). The general 
formulas are a little messy, so here we lay out a simple version 
based on the two-instrument example. The example shows 
that 2SLS using Z1; and z2; together is a weighted average of 
IV estimates using Z1; and Zz; one at a time. Let 


Cov(Y;, Zj) . 
a Cov(pj,Z;i) =a 
denote the two IV estimands using Z1; and Z3;. 

The (population) first-stage fitted values for 2SLS are 6; = 
T11Z1; +712Z2;, where m1, and x12 are positive numbers. By 
virtue of the IV interpretation of 2SLS, the 2SLS estimand is 

Cov(Y;, Ô;) — m11Cov(yi,Z1;) | 112 Cov(¥;, Z2;) 

PRLS E Cov(D;, Bj) i Cov(D;, Bj) Cov(D;, Ô;) 

_ [m11 Cov(D;, Z1;) | | Cov(Y;, Z1) 
-| Cov(D;, i) We 
E Cov(D;, | | Cov(Y;, Z2;) | 


Cov(D;, B;) Cov(Dj, Z2;) 
= Ypi + (1 = W)p25 


where y1 is LATE using Z4; and p2 is LATE using Zz;, and 


7041 Cov(Dj, Z1i) 


7x11 Cov(Dj, Z1;) + 112 Cov(Dj, Z2;) 
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is a number between zero and one that depends on the rela- 
tive strength of each instrument in the first stage. Thus, we 
have shown that 2SLS is a weighted average of causal effects 
for instrument-specific compliant subpopulations. Suppose, 
for example, that z;; denotes twin births and Z2; indicates 
(nontwin) same-sex sibships in families with two or more chil- 
dren, both instruments for family size, as in Angrist and Evans 
(1998). A multiple second birth increases the likelihood of 
having a third child by about .6, while a same-sex sibling 
pair increases the likelihood of a third birth by about .07. 
When these two instruments are used together, the resulting 
2SLS estimates are a weighted average of the Wald estimates 
produced by using the instruments one at a time.7® 


4.5.2 Covariates in the Heterogeneous Effects Model 


You might be wondering where the covariates have gone. After 
all, covariates played a starring role in our earlier discussion 
of regression and matching. Yet the LATE theorem does not 
involve covariates. This stems from the fact that when we see 
instrumental variables as a type of (natural or man-made) ran- 
domized trial, covariates take a back seat. If the instrument is 
randomly assigned, it is likely to be independent of covari- 
ates. Not all instruments have this property, however. As with 
covariates in the regression models in the previous chapter, 
the main reason why covariates are included in causal analyses 
using instrumental variables is that the conditional indepen- 
dence and exclusion restrictions underlying IV estimation may 
be more likely to be valid after conditioning on covariates. 
Even randomly assigned instruments, like draft eligibility sta- 
tus, may be valid only after conditioning on covariates. In the 
case of draft eligibility, older cohorts were more likely to be 
draft eligible because the draft eligibility cutoffs were higher. 
Because there are year-of-birth (or age) differences in earnings, 


28Using twins instruments alone, the IV estimate of the effect of a third 
child on female labor force participation is —.084. The corresponding same- 
sex estimate is —.138. Using both instruments produces a 2SLS estimate of 
—.098. The 2SLS weight in this case is .74 for twins, .26 for same sex, due to 
the much stronger twins first stage. 
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draft eligibility is a valid instrument only after conditioning on 
year of birth. 

More formally, IV estimation with covariates may be 
justified by a conditional independence assumption 


{Y1; Yoj, D1i, Doj} L Zj|X; (4.5.1) 


In other words, we think of the instrumental variables as being 
“as good as randomly assigned,” conditional on covariates X; 
(here we are implicitly maintaining the exclusion restriction 
as well). A second reason for incorporating covariates is that 
conditioning on covariates may reduce some of the variability 
in the dependent variable. This can lead to more precise 2SLS 
estimates. 

A benchmark constant effects model with covariates imposes 
functional form restrictions as follows: 


E[Yv;|X;] = Xjo* for a K x 1 vector of coefficients, œ*; 

Y1; — Yo; = P. 
In combination with (4.5.1), this motivates 2SLS estimation 
of an equation like (4.1.6), as discussed in section 4.1. 


A straightforward generalization of the constant effects 
model allows 


Y1i— Yo; = (Xi), 


where p(X;) is a deterministic function of X;. This model can 
be estimated by adding interactions between z; and X; to the 
first stage and (the same) interactions between D; and X; to the 
second stage. There are now multiple endogenous variables 
and hence multiple first-stage equations. These can be written 


D; = X'00 + M01Zi + ZiX;M02 +o;  (4.5.2a) 
DX; = Xino + 711Z; + Z;Xim12 + E1. (4.5.2b) 


Although (4.5.2b) is written as if D;X; is a scalar, there should 
be a first stage like this for each element of D;X;. The second- 
stage equation in this case is 


Y; = aX; + poD; + D;X; p1 + ni, 


Instrumental Variables in Action 177 


so p(X;) = po +p Xj. Alternatively, a nonparametric version 
of o(X;) can be estimated by 2SLS in subsamples stratified 
on Xj. 

The heterogeneous effects model underlying the LATE 
theorem allows for identification based on conditional inde- 
pendence as in (4.5.1), though the interpretation is a little more 
complicated than for LATE without covariates. For each value 
of X;, we define covariate-specific LATE, 


(Xj) = Elyii — Yoi|Xi, D1; > Doil. 


The “saturate and weight” approach to estimation with 
covariates, which generates a weighted average of A(X;), is 
spelled out in the following theorem (from Angrist and Imbens, 
1995). 


Theorem 4.5.1 Saturate and Weight. Suppose the assump- 
tions of the LATE theorem hold conditional on X;. That is, 
(CA1, Independence) {¥;(D1j, 1), Yoi(Dois 0), D1i, Doi} 4 Zi|Xis 
(CA2, Exclusion) P[y;(d,0) = Y;(d, 1)| X;] = 1 ford = 0,1; 
(CA3, First Stage) E[D4; — Do;|X;] 4 0. 
We also assume monotonicity (A4) holds as before. Con- 
sider the 2SLS estimand based on the first-stage equation 


Dj = Tx +™1xZj+ 1i (4.5.3) 
and the second-stage equation 
Y; = Qx + pcDi + Nis 


where mx and ax denote saturated models for covariates (a 
full set of dummies for all values of X;) and mıx denotes 
a first-stage effect of zi for every value of X;. Then p: = 
E[@(X;)A(X;)], where 

V{E[pD;| Xi, Zi] | Xi} 


oO) = Sry (EDAX, ZAK] PRN 


and 


V{E[pD;| Xj, Z;]|X;} = E{LE[D;|X;, Z;](E[D;|X;, zi] — E[D;|Xi])|Xi}. 
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This theorem says that 2SLS with a fully saturated first stage 
and a saturated model for covariates in the second stage pro- 
duces a weighted average of covariate-specific LATEs. The 
weights are proportional to the average conditional variance 
of the population first-stage fitted value, E[p,|X;,z;], at each 
value of X;.2? The theorem comes from the fact that the first 
stage coincides with E[p;|X;,z;] when (4.5.3) is saturated (i.e., 
the first-stage regression recovers the CEF). 

In practice, we may not want to work with a model with 
a first-stage parameter for each value of the covariates. First, 
there is the risk of bias, as we discuss at the end of this chapter, 
and second, a big pile of individually imprecise first-stage esti- 
mates is not pretty to look at. It seems reasonable to imagine 
that models with fewer parameters, say a restricted first stage 
imposing a constant 71x, nevertheless approximate some kind 
of covariate-averaged LATE. This turns out to be true, but the 
argument is surprisingly indirect. The vision of 2SLS as provid- 
ing a MMSE approximation to an underlying causal relation 
was developed by Abadie (2003). 

The Abadie approach begins by defining the object of inter- 
est to be E[y;|D;, X;, D1; > Do;], the CEF for Y; given treatment 
status and covariates, for compliers. An important feature of 
this CEF is that when the conditions of the LATE theorem 
hold conditional on X;, it has a causal interpretation. In other 
words, for compliers, treatment-control contrasts conditional 
on X; are equal to conditional-on-X; LATEs: 


E [y;|D; = 1, X;, D1; > Do;] — E[y;|D; = 0, Xi, D1; > Doi] 
= E [Y1; — Yoj|Xi, D1; > Doj] 
= i(X;j). 


This follows immediately from the facts that D; = z; for com- 
pliers and, given (4.5.1), potential outcomes are independent 
of z; given X; and D1; > Do;. The upshot is that a regres- 
sion of Y; on D; and X; in the complier population also has 
a causal interpretation. Although this regression might not 


2°Note that the variability in E[p;|X;,z;] conditional on X; comes from Z;. 
So the weighting formula gives more weight to covariate values where the 
instrument creates more variation in fitted values. 
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give us the CEF of interest (unless it is linear or the model 
is saturated), it will, as always, provide the MMSE approxi- 
mation to it. That is, a regression of Y; on D; and X; in the 
complier population approximates E[y;|D;, X;, D1; > Do;], just 
as OLS approximates E[y;|D,,X;]. Alas, we do not know who 
the compliers are, so we cannot sample them. Nevertheless, 
they can be found, in the following sense. 


Theorem 4.5.2 Abadie Kappa. Suppose the assumptions of 
the LATE theorem hold conditional on covariates, X;. Let 
g(¥;,D;,X;) be any measurable function of (¥;,D;,X;) with finite 
expectation. Define 


D,(1 — Z;) (1 —pj)z; 
1—P(z;=1|X;)  P(z; = 11X;)" 


Kj =1 


Then 
E[g(¥i, Di, Xi)|D1i > Doi] = he ALOE 
E[ki] 

This can be proved by direct calculation using the fact that, 
given the assumptions of the LATE theorem, any expectation 
is a weighted average of means for always-takers, never-takers, 
and compliers. By monotonicity, those with p;(1—z;) = 1 
are always-takers because they have Do; = 1, while those 
with (1—p;)z; = 1 are never-takers because they have D4; = 0. 
Hence, the compliers are the left-out group. 

The Abadie theorem has a number of important implica- 
tions; for example, it crops up again in the discussion of 
quantile treatment effects. Here, we use it to approximate 
E[y;|D;, Xi, D1; >Do;] by linear regression. Specifically, let a, 
and 6- solve 


(ac, Bc) = arg min E{(E[y;|Dj, X;,D1; > Doi] 
a,b 
— aD; — Xb) |D1; > Do;}. 
In other words, w-Dj+X;8, gives the MMSE approximation 


to Ely;|D;, Xj, D1; >Do;], or fits it exactly if it’s linear. A 
consequence of Abadie’s theorem is that this approximating 
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function can be obtained by solving 


(Œc, Be) = arg min E{x;(y; — ap; — X/b)}, (4.5.5) 
a,b 


the kappa-weighted least squares minimand.°*? 

Abadie proposes an estimation strategy (and develops dis- 
tribution theory) for a procedure that involves first-step esti- 
mation of «; using parametric or semiparametric models for 
P(z; = 1|X;). The estimates from the first step are then plugged 
into the sample analog of (4.5.5) in the second step. Not 
surprisingly, when the only covariate is a constant, Abadie’s 
procedure simplifies to the Wald estimator. More surprisingly, 
minimization of (4.5.5) produces the traditional 2SLS estima- 
tor as long as a linear model is used for P(z; = 1|X;) in the 
construction of «;. In other words, if P(z; = 1|X;) = Xim is 
used when constructing an estimate of «;, the Abadie estimand 
is 2SLS. Thus, we can conclude that whenever P(z; = 1|X;) 
can be fit or closely approximated by a linear model, it makes 
sense to view 2SLS as an approximation to the complier causal 
response function, E[y;|D;, X;, D1; > Do;]. On the other hand, 
dq is not, in general, the 2SLS estimand, and £, is not, in 
general, the vector of covariate effects produced by 2SLS. Still, 
the equivalence to 2SLS for linear P(z; = 1|X;) leads us to think 
that Abadie’s method and 2SLS are likely to produce similar 
estimates in most applications. 

The Angrist (2001) reanalysis of Angrist and Evans (1998) is 
an example where estimates based on (4.5.5) are indistinguish- 
able from 2SLS estimates. Using twins instruments to estimate 
the effect of a third child on female labor supply generates a 
2SLS estimate of —.088, while the corresponding Abadie esti- 
mate is —.089. Similarly, 2SLS and Abadie estimates of the 
effect on hours worked are identical at —3.55. This is not a 
strike against Abadie’s procedure. Rather, it supports 


30The class of approximating functions needn’t be linear. Instead of 
ate Dj + X/B-, it might make sense to use a nonlinear function such as an expo- 
nential (if the dependent variable is non-negative) or probit (if the dependent 
variable is zero-one). We return to this point at the end of this chapter. As 
noted in section 4.4.4, the kappa-weighting scheme can be used to charac- 
terize covariate distributions for compliers as well as to estimate outcome 
distributions. 
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the notion that 2SLS approximates the causal relation of 
interest.>! 


4.5.3 Average Causal Response with 
Variable Treatment Intensity* 


An important difference between the causal effects of a dummy 
variable and those of a variable that takes on the values 
{0,1,2,...} is that in the first case, there is only one causal 
effect for any one person, while in the latter there are many: 
the effect of going from 0 to 1, the effect of going from 1 to 
2, and so on. The potential outcomes notation we used for 
schooling recognizes this. Here it is again. Let 


Ysi = fils) 


denote the potential (or latent) earnings that person i would 
receive after obtaining s years of education. Note that the func- 
tion f;(s) has an “i” subscript on it but s does not. The function 
fi(s) tells us what i would earn for any value of schooling, s, 
and not just for the realized value, s;. In other words, f;(s) 
answers causal “what if” questions for multinomial s;. 
Suppose that s; takes on values in the set {0,1,...,s}. Then 
there are § unit causal effects, Ys; — Ys_1,. A linear causal model 
assumes these are the same for all s and for all i, obviously 
unrealistic assumptions. But we need not take these assump- 
tions literally. Rather, 2SLS provides a computational device 
that generates a weighted average of unit causal effects, with 
a weighting function we can estimate and study, so as to learn 


3!The Abadie estimator can be computed by weighting conventional linear 
or nonlinear regression software. The trick is to first construct a weighting 
scheme with positive weights. This is accomplished by iterating expectations 
in (4.5.5), so that «; (which is negative for always-takers and never-takers) 
can be replaced by the always-positive average weight, 


D;(1 — Elz;|X;, D; Y;]) (1 — D,)E[Z;| Xj, Dj, Yi] 
1 —P(z; = 1|X;) P(zi = 1|X;) 


E[x«j|Xi, Di, Y;] = 1 


(See also the discussion in section 7.2.1.) Abadie (2003) gives formulas for 
standard errors and Alberto Abadie has posted software to compute them, as 
well as the corresponding parameter estimates. Standard errors for the Abadie 
estimator can also be estimated using a bootstrap. 
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where the action is coming from with a particular instrument. 
This weighting function tells us how the compliers are dis- 
tributed over the range of s;. It tells us, for example, that the 
returns to schooling estimated using quarter of birth or com- 
pulsory schooling laws come from shifts in the distribution of 
grades completed in high school. Other instruments, such as 
the distance instruments used by Card (1995), act elsewhere 
on the schooling distribution and therefore capture a different 
sort of return. 

To flesh this out, suppose that a single binary instrument, z;, 
a dummy for having been born in a state with restrictive com- 
pulsory attendance laws, is to be used to estimate the returns 
to schooling (as in Acemoglu and Angrist, 2000). Also, let s4; 
denote the schooling i would get if z; = 1, and let so; denote 
the schooling 7 would get if z; = 0. The theorem below, from 
Angrist and Imbens (1995), offers an interpretation of the 
Wald estimand with variable treatment intensity in this case. 
Note that here we combine the independence and exclusion 
restrictions by simply stating that potential outcomes indexed 
by s are independent of the instruments. 


Theorem 4.5.3 Average Causal Response. Suppose 

(ACR1, Independence and Exclusion) {Yoi,Y1is...,Y3i; 
Sois S1i} L Zj; 

(ACR2, First Stage) E[S1; — So;] # 0; 

(ACR3, Monotonicity) $1; — So; > 0 Vi, or vice versa; assume 


the first. 
Then 
Ely,|z; = 1] — E[Y;|z; = 0] 
E[s;|zZ; = 1] — E[s;|z; = 0] 
= So @El¥si —Y5_-14|S1i = S$ > Soil, 
s=1 
where 
Plsi; = s > Soi] 
Ws 


7 rar Plsi; > j > Soil 


The weights w; are non-negative and sum to one. 
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The average causal response (ACR) theorem says that the 
Wald estimator with variable treatment intensity is a weighted 
average of the unit causal response along the length of the 
potentially nonlinear causal relation described by f;(s). The 
unit causal response, E[Ysi — Ys—1,|S1i = $ > Soi], is the average 
difference in potential outcomes for compliers at point s, that 
is, individuals driven by the instrument from a treatment inten- 
sity less than s to at least s. For example, the quarter-of-birth 
instruments used by Angrist and Krueger (1991) push some 
people from 11th grade to finishing 12th or higher, and others 
from 10th grade to finishing 11th or higher. The Wald esti- 
mator using quarter-of-birth instruments combines all these 
effects into a single ACR. 

The size of the group of compliers at point s is P[sy; > s > 
Soi]. By monotonicity, this must be non-negative and is given 
by the difference in the CDF of s; at point s. To see this, note 
that 


which is non-negative since monotonicity requires s1; > Soi- 
Moreover, 


P{so; < s] — P[sy; < s] = P[s; < s|Z; = 0] — P[s; < s|z; = 1] 


by independence. Finally, note that because the mean of a non- 
negative random variable is the sum (or integral) of one minus 
the CDF, we have, 


F[s;|Z; = 1] — E[s;|z; = 0] 


= X (Pis; < j|Zi = 0] — P[s; < |Z; — 1]) 
j=1 


=X Pis; > j > soil 


j=1 


Thus, the ACR weighting function can be consistently esti- 
mated by comparing the CDFs of the endogenous variable 
(treatment intensity) with the instrument switched off and on. 
The weighting function is normalized by the first stage. 
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The ACR theorem helps us understand what we are learn- 
ing from a 2SLS estimate. For example, instrumental variables 
derived from compulsory attendance and child labor laws cap- 
ture the causal effect of increases in schooling in the 6-12 
grade range, but tell us little about the effects of postsec- 
ondary schooling. This is illustrated in figure 4.5.1, taken from 
Acemoglu and Angrist (2000). 

The figure plots differences in the probability that educa- 
tional attainment is at or exceeds the grade level on the x-axis 
(i.e., one minus the CDF). The differences are between men 
exposed to different child labor laws and compulsory school- 
ing laws in the sample of white men aged 40-49 drawn from 
the 1960, 1970, and 1980 censuses. The instruments are coded 
as the number of years of schooling required either to work 
(panel A) or to leave school (panel B) in the year the respon- 
dent was age 14. Men exposed to the least restrictive laws are 
the reference group. Each instrument (e.g., a dummy for seven 
years of schooling required before work is allowed) can be 
used to construct a Wald estimator by making comparisons 
with the reference group. 

The top panel of figure 4.5.1 shows that men exposed to 
more restrictive child labor laws were one to six percentage 
points more likely to complete grades 8-12. The intensity of 
the shift depends on whether the laws required seven, eight, or 
nine-plus years of schooling before work was allowed. But in 
all cases, the CDF differences decline at lower grades, and drop 
off sharply after grade 12. The bottom panel shows a similar 
pattern for compulsory attendance laws, though the effects 
are a little smaller and the action here is at somewhat higher 
grades, consistent with the fact that compulsory attendance 
laws are typically binding in higher grades than child labor 
laws. Interestingly, the child labor and compulsory atten- 
dance instruments generate similar 2SLS estimates of about 
.08-.10. 

Before wrapping up our discussion of LATE generalizations, 
it’s worth noting that most of the elements covered here work 
in combination. For example, models with multiple instru- 
ments and variable treatment intensity generate a weighted 
average of the ACR for each instrument. Likewise, the saturate 
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Figure 4.5.1 The effect of compulsory schooling instruments on 
education (from Acemoglu and Angrist 2000). The figures show the 
instrument-induced difference in the probability that schooling is 
greater than or equal to the grade level on the x-axis. The reference 
group is six or fewer years of required schooling in the top panel 
and eight or fewer years in the bottom panel. The top panel shows 
the CDF difference by severity of child labor laws. The bottom 
panel shows the CDF difference by severity of compulsory 
attendance laws. 


and weight theorem applies to models with variable treat- 
ment intensity (though we do not yet have an extension of 
Abadie’s kappa for models with variable treatment inten- 
sity). A final important extension covers the scenario where 
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the causal variable of interest is continuous and we can 
therefore think of the causal response function as having 
derivatives. 


So Long, and Thanks for All the Fish 


Suppose, as with the schooling problem, that counterfactuals 
are generated by an underlying functional relation. In this case, 
however, the causal variable of interest can take on any non- 
negative value and the functional relation is assumed to have 
a derivative. An example where this makes sense is a demand 
curve, the quantity demanded as a function of price. In par- 
ticular, let q;(p) denote the quantity demanded in market i at 
hypothetical price p. This is a potential outcome, like f;(s), 
except that instead of individuals the unit of observation is a 
time or a location or both. For example, Angrist, Graddy, and 
Imbens (2000) estimate the elasticity of quantity demanded 
at the Fulton wholesale fish market in New York City. The 
slope of this demand curve is q;(p); if quantity and price are 
measured in logs, this is an elasticity. 

The instruments in Angrist, Graddy, and Imbens (2000) are 
derived from data on weather conditions off the coast of Long 
Island, not too far from major commercial fishing grounds. 
Stormy weather makes it hard to catch fish, driving up the 
price and reducing the quantity demanded. Angrist, Graddy, 
and Imbens use dummy variables such as stormy;, a dummy 
indicating periods with high wind and waves, to estimate the 
demand for fish. The data consist of daily observations on 
wholesale purchases of whiting, a cheap fish used for fish cakes 
and things like that. 

The Wald estimator using the stormy; instrument can be 
interpreted using 


Elq;\stormy; = 1] — E[q;|stormy; = 0] 
E[p;|stormy; = 1] — E[p;|stormy; = 0] 


Efqi(t)| Pa; = t > PoilP[P1i = t > Poildt 
J Piri; = t > Posldt i 


(4.5.6) 


Instrumental Variables in Action 187 


where p; is the price in market (day) i and Py; and Po; are 
potential prices indexed by stormy;. This is a weighted average 
derivative with weighting function P[P1; > t > Po;] = P[p; < 
t|stormy; = 0] — P[p; < t|stormy; = 1] at price t. In other 
words, IV estimation using stormy; produces an average of 
the derivative q;(t), with weight given to each possible price 
(indexed by t) in proportion to the instrument-induced change 
in the cumulative distribution function (CDF) of prices at that 
point. This is the same sort of averaging as in the ACR theorem 
except that now the underlying causal response is a derivative 
instead of a one-unit difference. 

The continuous ACR formula, (4.5.6), comes from the fact 
that 


P1; 
Elqilstormy; = 1] — E[qilstormy; = 0] = ef gilt)dt], 
PO; 


(4.5.7) 


by the independence assumption and the fundamental theorem 
of calculus. Two interesting special cases fall neatly out of 
(4.5.7). The first is when the causal response function is linear, 
that is, 9;(p) = ao; + a1;p, for some random coefficients, œo; 
and a1;. Then, we have 


Elqi|stormy; = 1] — Elq;|stormy; = 0] — Eloy;(P1; — Poi)] 


E[pi|stormy; = 1] — E[p;|stormy; = 0] E[P1; — Por] 
(4.5.8) 


3 


a weighted average of the random coefficient, œ1;. The weights 
are proportional to the price change induced by the weather 
in market i. 

The second special case is when we can write quantity 
demanded as 


qilp) = O(p) + ni, (4.5.9) 


where O(p) is a nonstochastic function and n; is an additive 
random error. By this we mean q;(p) = Q'(p) every day or in 
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every market. In this case, the average causal response function 
becomes 
P[Py; = t > Poi] 
f Pre =r> Po;ldr 


| O'(t)æ(t)dt, where w(t) 


where r is the integrating variable in the denominator. 

These special cases highlight the two types of averaging 
wrapped up in the ACR theorem and its continuous corollary, 
(4.5.6). First, there is averaging across markets, with weights 
proportional to the first-stage impact on prices in each market. 
Markets where prices are highly sensitive to the weather con- 
tribute the most. Second, there is averaging along the length 
of the causal response function in a given market. IV recov- 
ers the average derivative over the range of prices where the 
instruments shift the CDF of prices most sharply. 


4.6 IV Details 


4.6.1 2SLS Mistakes 


2SLS estimates are easy to compute, especially since software 
packages like SAS and Stata will do it for you. Occasionally, 
however, you might be tempted to do it yourself just to see if 
it really works. Or you may be stranded on the planet Krikkit 
with all of your software licenses expired (Krikkit is encased 
in a slo-time envelope, so it will take you a long time to get 
licenses renewed). Manual 2SLS is for just such emergencies. 
In the manual 2SLS procedure, you estimate the first stage 
yourself (which in any case you should be looking at) and plug 
the fitted values into the second-stage equation, which is then 
estimated by OLS. Returning to the system at the beginning of 
this chapter, the first and second stages are 


S; = XiT10 +71 Zi + &1; 
Y; = aX; + p83; + [n;i + p(Si — ŝi)], 


where X; is a set of covariates, Z; is a set of excluded instru- 
ments, and the first-stage fitted values are $; = Xi f10 + 771 ,Zi. 


Instrumental Variables in Action 189 


Manual 2SLS takes some of the mystery out of canned 2SLS 
and may be useful in a software crisis, but it opens the door 
to mistakes. For one thing, as we discussed earlier, the OLS 
standard errors from the manual second stage will not be cor- 
rect (the OLS residual variance is the variance of n; + p(s; — §;), 
while for proper 2SLS standard errors you want the variance 
of n; only). There are more subtle risks as well. 


Covariate Ambivalence 


Suppose the covariate vector contains two sorts of variables, 
some (say, Xo;) that you are comfortable with, and others 
(say, X4,;) about which you are ambivalent. Griliches and 
Mason (1972) faced this scenario when constructing 2SLS 
estimates of a wage equation that treats AFQT scores (an 
ability test used by the armed forces) as an endogenous vari- 
able to be instrumented. The instruments for AFQT are early 
schooling (completed before military service), race, and fam- 
ily background variables. They estimated a system that can be 


described like this: 


S; = Xo;%10 + 141Z; + E1; 
Y; = a Xo; + 1X1; + 08; + [ni + 0(S; — ŝ;)]. 


This looks a lot like manual 2SLS. 

A closer look, however, reveals an important difference 
between the equations above and the usual 2SLS procedure: 
the covariates in the first and second stages are not the same. 
For example, Griliches and Mason included age in the sec- 
ond stage but not in the first, a fact noted by Cardell and 
Hopkins (1977) in a comment on their paper. This was a mis- 
take. Griliches and Mason’s second-stage estimates are not 
the same as 2SLS. What’s worse, they are inconsistent where 
2SLS might have been fine. To see why, note that the first-stage 
residual, s; — §;, is uncorrelated with Xo; by construction, since 
OLS residuals are always uncorrelated with included regres- 
sors. But because X4; is not included in the first stage it is 
likely to be correlated with the first-stage residuals (e.g., age is 
probably correlated with the AFQT residual from the Griliches 
and Mason (1972) first stage). The inconsistency from this 
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correlation spills over to all coefficients in the second stage. 
The moral of the story: Put the same exogenous covariates in 
your first and second stage. If a covariate is good enough for 
the second stage, it’s good enough for the first. 


Forbidden Regressions 


Forbidden regressions were forbidden by MIT professor Jerry 
Hausman in 1975, and while they occasionally resurface in 
an undersupervised thesis, they are still technically off-limits. 
A forbidden regression crops up when researchers apply 2SLS 
reasoning directly to nonlinear models. A common scenario 
is a dummy endogenous variable. Suppose, for example, that 
the causal model of interest is 


Y; = aX; + pD; + ni, (4.6.1) 


where D; is a dummy variable for veteran status. The usual 
2SLS first stage is 


Dj = ToX; +711 Zi + fri, (4.6.2) 


a linear regression of D; on covariates and a vector of 
instruments, Z;. 

Because D; is a dummy variable, the CEF associated with 
this first stage, E[p,|X;,Z;], is probably nonlinear. So the 
usual OLS first stage is an approximation to the underlying 
nonlinear CEF. We might, therefore, use a nonlinear first 
stage in an attempt to come closer to the CEF. Suppose that 
we use probit to model E[p;|X;, Z;]. The probit first stage is 
DLT oX; + 71 Zils where zo and zp are probit coefficients 
and the fitted values are Bp; = P[7 0, X; + Ty Zil. The forbid- 
den regression in this case is the second-stage equation created 
by substituting Ôp; for D;: 


Y; = «X; + pÔpi + [ni + e(Di — Ôpi)]. (4.6.3) 


The problem with (4.6.3) is that only OLS estimation of (4.6.2) 
is guaranteed to produce first-stage residuals that are uncor- 
related with fitted values and covariates. If E[p,|X;,Z;] = 
DIX; npo + 71 Zil, then residuals from the nonlinear model 
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will be asymptotically uncorrelated with X; and Ĥp;, but who is 
to say that the first-stage CEF is really probit? In contrast, with 
garden variety 2SLS, we do not need to worry about whether 
the first-stage CEF is really linear.** 

A simple alternative to the forbidden second step, (4.6.3), 
avoids problems due to an incorrect nonlinear first stage. 
Instead of plugging in nonlinear fitted values, we can use the 
nonlinear fitted values as instruments. In other words, use 5p; 
as an instrument for D; in (4.6.1) in a conventional 2SLS pro- 
cedure (as always, the exogenous covariates, X;, should also 
be in the instrument list). Use of fitted values as instruments is 
the same as plugging in fitted values when the first stage is 
estimated by OLS, but not in general. Using nonlinear fits as 
instruments has the further advantage that, if the nonlinear 
model gives a better approximation to the first-stage CEF than 
the linear model, the resulting 2SLS estimates will be more 
efficient than those using a linear first stage (Newey, 1990). 

But here, too, there is a drawback. The procedure using 
nonlinear fitted values as instruments implicitly uses nonlin- 
earities in the first stage as a source of identifying information. 
To see this, suppose the causal model of interest includes the 
vector of instruments, Z;: 


Y; =a'X;+ y'Zi + pDi + ni. (4.6.4) 


Now, with the first stage given by (4.6.2), the model is uniden- 
tified, and conventional 2SLS estimates of (4.6.4) don’t exist. 
In fact, 4.6.4 violates the exclusion restriction. But 2SLS esti- 
mates using X;, Z;, and Ôp; as instruments do exist, because 
Dp; is a nonlinear function of X; and Z; that is excluded from 
the second stage. Should you use this nonlinearity as a source 
of identifying information? We usually prefer to avoid this 
sort of back-door identification since it’s not clear what the 
underlying experiment really is. 


3? The insight that consistency of 2SLS estimates in a traditional SEM does 
not depend on correct specification of the first-stage CEF goes back to Kelejian 
(1971). Use of a nonlinear plug-in first stage may not do too much damage 
in practice—a probit first stage can be pretty close to linear—but why take a 
chance when you don’t have to? 
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As a rule, naively plugging in first-stage fitted values in 
nonlinear models is a bad idea. This includes models with 
a nonlinear second stage as well as those where the CEF for 
the first stage is nonlinear. Suppose, for example, that you 
believe the causal relation between schooling and earnings is 
approximately quadratic but otherwise homogeneous (as in 
Card’s (1995) structural model). In other words, the model of 
interest is 


Y; = o/Xj + p1Si + p28? + ni. (4.6.5) 


Given two instruments, it’s easy enough to estimate (4.6.5), 
treating both s; and s? as endogenous. In this case, there are 
two first-stage equations, one for s; and one for s?. Although 
you need at least two instruments for this to work, it’s natural 
to use the original instrument and its square (unless the only 
instrument is a dummy, in which case you'll need a better 
idea). 

You might be tempted, however, to work with a single first 
stage, say equation (4.6.2), and estimate the following second 
stage manually: 


Y; = aX; + p18; + 0287 + [ni + 01(8; — $i) + 287 — 8; )]. 


This is a mistake, since ŝ; can be correlated with s? — 3? while 
$? can be correlated with both s; — ŝ; and s? — $?. In contrast, 
as long as X; and Z; are uncorrelated with n; in (4.6.5) and 
you have enough instruments in Z;, 2SLS estimation of (4.6.5) 
is straightforward. 


4.6.2 Peer Effects 


A vast literature in social science is concerned with peer effects. 
Loosely speaking, this means the causal effect of group charac- 
teristics on individual outcomes. Sometimes regression is used 
in an attempt to uncover these effects. In practice, the use of 
regression models to estimate peer effects is fraught with peril. 
Although this is not really an IV issue per se, the language and 
algebra of 2SLS help us understand why peer effects are hard 
to identify. 
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Broadly speaking, there are two types of peer effects. The 
first concerns the effect of group characteristics such as the 
average schooling in a state or city on individual outcomes as 
described by another variable. For example, Acemoglu and 
Angrist (2000) ask whether a given individual’s earnings are 
affected by the average schooling in his or her state of resi- 
dence. The theory of human capital externalities suggests that 
living in a state with a more educated workforce may make 
everyone in the state more productive, not just those who are 
more educated. This kind of spillover is said to be a social 
return to schooling: human capital that benefits everyone, 
whether or not they are more educated. 

A causal model that allows for such externalities can be 
written 


Yije = Mj + Ar + Sit + PSi + Uj + nies (4.6.6) 


where Y; is the log weekly wage of individual in state j in year 
t, uj, is a state-year error component, and n; is an individual 
error term. The controls u; and A; are state-of-residence and 
year effects. The coefficient p is the returns to schooling for 
an individual, while the coefficient y is meant to capture the 
effect of average schooling, Sj, in state j and year t. 

In addition to the usual concerns about s;, the most impor- 
tant identification problem raised by equation (4.6.6) is omit- 
ted variables bias from correlation between average schooling 
and other state-year effects embodied in the error component 
ujt. For example, public university systems may expand during 
cyclical upturns, generating a common trend in state average 
schooling levels and state average earnings. Acemoglu and 
Angrist (2000) attempt to solve this problem using instru- 
mental variables derived from historical compulsory atten- 
dance laws that are correlated with Sit but uncorrelated with 
contemporaneous uj and n;. 

While omitted state-year effects are the primary concern 
motivating Acemoglu and Angrist’s (2000) IV estimation, the 
fact that one regressor, Sj, is the average of another regressor, 
S; also complicates the interpretation of OLS estimates of 
equation (4.6.6). To see this, consider a simpler version 
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of (4.6.6) with a cross-section dimension only. This can be 
written 


Y; = U + 108i + m15; + Vij; (4.6.7) 


where Y; is the log weekly wage of individual i in state j and 
Š; is average schooling in the state. The coefficients zo and 
zı are defined so that the error, v,, is uncorrelated with both 
regressors. Now, let pọ denote the coefficient from a bivariate 
regression of Yj on s; only and let pı denote the coefficient 
from a bivariate regression of Yj on $; only. From the discus- 
sion of grouping and 2SLS earlier in this chapter, we know that 
pı is the 2SLS estimate of the coefficient on s; in a bivariate 
regression of Yj on s; using a full set of state dummies as instru- 
ments. The appendix uses this fact to show that the parameters 
in equation (4.6.7) can be written in terms of pọ and pı as 


To = p1 + b(p0 — p1) (4.6.8) 
xı = $(01— po), 


where ¢ = — > 1, and R°? is the first-stage R? when state 
1-R2 8 


dummies are used as instruments for sj. 

The upshot of (4.6.8) is that if, for any reason, OLS 
estimates of the bivariate regression of wages on individual 
schooling differ from 2SLS estimates using state dummy instru- 
ments, the coefficient on average schooling in (4.6.7) will be 
nonzero. For example, if instrumenting with state dummies 
corrects for attenuation bias due to measurement error in 
S; we have p1 > po and the spurious appearance of positive 
social returns. In contrast, if instrumenting with state dum- 
mies eliminates the bias from positive correlation between s; 
and unobserved earnings potential, we have 1 < po, and the 
appearance of negative social returns.’ In practice, therefore, 
it is very difficult to isolate social effects by OLS estimation of 


33The coefficient on average schooling in an equation with individual 
schooling can be interpreted as the Hausman (1978) test statistic for the 
equality of OLS estimates and 2SLS estimates of private returns to schooling 
using state dummies as instruments. Borjas (1992) discusses a similar problem 
affecting the estimation of ethnic-background effects. 
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an equation like (4.6.6), though more sophisticated IV strate- 
gies where both the individual and group averages are treated 
as endogenous may work. 

A second and even more difficult peer effect to uncover is 
the effect of the group average of a variable on the individual 
level of this same variable. This is not really an IV problem; 
it takes us back to basic regression issues. To see this point, 
suppose that $; is the high school graduation rate in school j, 
and we would like to know whether students are more likely 
to graduate from high school when everyone around them is 
more likely to graduate from high school. To uncover the peer 
effect on high school graduation rates, we might work with a 
regression model like: 


Sj = U+ mS; + vij, (4.6.9) 


where s; is individual i’s high school graduation status and S; 
is the average high school graduation rate in school j, which i 
attends. 

At first blush, equation (4.6.9) seems like a sensible formula- 
tion of a well-defined causal question, but in fact it is nonsense. 
The regression of s; on S; always has a coefficient of 1, a con- 
clusion that can be drawn immediately once you recognize S; as 
the first-stage fitted value from a regression of sj on a full set 
of school dummies.** Thus, an equation like (4.6.9) cannot 
possibly be informative about causal effects. A modestly 
improved version of this bad peer regression changes (4.6.9) to 


Sj = U + 13S (ij + Vij (4.6.10) 
34Here is a direct proof that the regression of sj on 5; is always unity: 


Yo dS silSj - S) Yo (8-8) s 
j i i 
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where Soy is the mean of s; in school j, excluding student i. 
This is a step in the right direction—z3 is no longer auto- 
matically equal to 1—but still problematic because s; and 
Suy are both affected by school-level random shocks that are 
implicitly part of vj. The presence of random group effects 
in the error term raises important issues for statistical infer- 
ence, issues discussed at length in chapter 8. But in an equation 
like (4.6.10), group-level random shocks are more than a prob- 
lem for standard errors: any shock common to the group 
(school) creates spurious peer effects. For example, particu- 
larly effective school principals may raise graduation rates for 
everyone in the schools at which they work. This looks like 
a peer effect, since it induces correlation between sj; and Soy 
even if there is no causal link between peer means and indi- 
vidual student achievement. We therefore prefer not to see 
regressions like (4.6.10) either. 

The best shot at a causal investigation of peer effects focuses 
on variation in ex ante peer characteristics, that is, some mea- 
sure of peer quality that predates the outcome variable and 
is therefore unaffected by common shocks. A recent example 
is Ammermueller and Pischke (2006), who studied the link 
between classmates’ family background, as measured by the 
number of books in their homes, and student achievement in 
European primary schools. The Ammermueller and Pischke 
regressions are versions of 


Sj = u + W4Biy + Vij 


where By; is the average number of books in the home of stu- 
dent i’s peers. This looks like (4.6.10), but with an important 
difference. The variable Bij is a feature of the home environ- 
ment that predates test scores and is therefore unaffected by 
school-level random shocks. 

Angrist and Lang (2004) provide another example of 
an attempt to link student achievement with the ex ante 
characteristics of peers. The Angrist and Lang study looked 
at the impact of bused-in low-achieving newcomers on high- 
achieving residents’ test scores. The regression of interest in 
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this case is a version of 
Sj = H + TS5Mj + Vij, (4.6.11) 


where 77; is the number of bused-in low achievers in school j 
and sj is resident student ’s test score. Spurious correlation 
due to common shocks is not a concern in this context, for two 
reasons. First, m; is a feature of the school population deter- 
mined by students outside the sample used to estimate (4.6.11). 
Second, the number of low achievers is an ex ante variable 
biased on information about where the students come from 
and not the outcome variable, sj. School-level random effects 
that are part of vj remain an important issue for inference, 
however, since 7; is a group-level variable. 


4.6.3 Limited Dependent Variables Reprise 


In section 3.4.2, we discussed the consequences of limited 
dependent variables for regression models. When the depen- 
dent variable is binary or non-negative—say, employment 
status or hours worked—the CEF is typically nonlinear. Most 
nonlinear LDV models are built around a nonlinear transfor- 
mation of a linear latent index. Examples include probit, logit, 
and Tobit. These models capture features of the associated 
CEFs (e.g., probit fitted values are guaranteed to be between 
zero and one, while Tobit fitted values are non-negative). Yet 
we saw that the added complexity and extra work required 
to interpret the results from latent index models may not be 
worth the trouble. 

An important consideration in favor of OLS is a conceptual 
robustness that structural models often lack. OLS always gives 
a MMSE linear approximation to the CEF. In fact, we can 
think of OLS as a scheme for computing marginal effects— 
a scheme that has the virtue of simplicity, automation, and 
comparability across studies. Nonlinear latent index models 
are more like GLS: they provide an efficiency gain when taken 
literally, but require a commitment to functional form and 
distributional assumptions, about which we do not usually feel 
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strongly.*> A second consideration is the distinction between 
the latent index parameters at the heart of nonlinear models 
and the average causal effects that we believe should be the 
objects of primary interest in most research projects. 

The arguments in favor of conventional OLS with LDVs 
apply with equal force to 2SLS and models with endogenous 
variables. IV methods capture local average treatment effects 
regardless of whether the dependent variable is binary, non- 
negative, or continuously distributed. With covariates, we can 
think of 2SLS as estimating LATE averaged across covari- 
ate cells. In models with variable or continuous treatment 
intensity, 2SLS gives us the average causal response or an aver- 
age derivative. Although Abadie (2003) has shown that 2SLS 
does not, in general, provide the MMSE approximation to 
the complier causal response function, in practice, 2SLS esti- 
mates come out remarkably close to estimates using the more 
rigorously grounded Abadie procedure (and with a saturated 
model for covariates, 2SLS and Abadie are the same). More- 
over, 2SLS estimates LATE directly; there is no intermediate 
step involving the calculation of marginal effects. 

2SLS is not the only way to go. An alternative, more elabo- 
rate approach tries to build up a causal story by describing the 
process generating LDVs in detail. A good example is bivari- 
ate probit, which can be applied to the Angrist and Evans 
(1998) example like this. Suppose that a woman decides to 


35The analogy between nonlinear LDV models and GLS is more than rhetor- 
ical. Consider a probit model with nonlinear CEF, E[y;|X;] = o[~2 ] =f. 
The first-order conditions for maximum likelihood estimation of this model 
are 


(Yi=ri)Xi 
2 ri(l— ri) Sar 


Maximum likelihood is asymptotically the same as GLS estimation of the 
nonlinear regression model 


o 


Xip* 
v=o if |+5 


since the conditional variance of Y; is r;(1 — r;). The only difference is that 
GLS is done in two steps. 
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have a third child by comparing costs and benefits using a 
net benefit function or latent index that is linear in covariates 
and excluded instruments, with a random error term, v;. The 
bivariate probit first stage can be written 


D; = 1[Xiy + yfzi > vil, (4.6.12) 


where Z; is an instrumental variable that increases the benefit 
of a third child, conditional on covariates, X;. For example, 
American parents appear to value a third child more when they 
have had either two boys or two girls, a sort of portfolio diver- 
sification phenomenon that can be understood as increasing 
the benefit of a third child in families with same-sex sibships. 

An outcome variable of primary interest in this context is 
employment status, a Bernoulli random variable with a con- 
ditional mean between zero and one. To complete the model, 
suppose that employment status, Y;, is determined by the latent 
index 


Yj = 1[X; 85 + PFD; > ej], (4.6.13) 


where ¢; is a second random component or error term. This 
latent index can be seen as arising from a comparison of the 
costs and benefits of working. 

The source of omitted variables bias in the bivariate pro- 
bit setup is correlation between v; and ¢;. In other words, 
unmeasured random determinants of childbearing are corre- 
lated with unmeasured random determinants of employment. 
The model is identified by assuming z; is independent of 
these components, and that the random components are nor- 
mally distributed. Given normality, the parameters in (4.6.12) 
and (4.6.13) can be estimated by maximum likelihood. The 
log likelihood function is 


X’ B* *D; Xyë “Zi 
Ernos ( iBo + Pi > MEA i) 


Os Oy 
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(4.6.14) 
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where ©,(-,-3 pev) is the bivariate normal distribution func- 
tion with correlation coefficient p,,. Note, however, that we 
can multiply the latent index coefficients and error standard 
deviations (o,,0,) by a positive constant without changing 
the likelihood. The object of estimation is therefore the ratio 
of the index coefficients to the error standard deviations 
(e.g., Bt /0e). 

The potential outcomes defined by the bivariate probit 
model are 


Yo; = 1[X/f9 > £;] and Y1; = 1[X}65 + Bi > ei], 
while potential treatment assignments are 
Doi = 1[X}yg > vi] and Dy; = 1[X;yő + yi > vil. 


As usual, only one potential outcome and one potential assign- 
ment are observed for any one person. It’s also clear from this 
representation that correlation between v; and e; is the same 
thing as correlation between potential treatment assignments 
and potential outcomes. 

The latent index coefficients do not themselves tell us any- 
thing about the size of the causal effect of childbearing on 
employment other than the sign. To see this, note that the 
average causal effect of childbearing is 


E[Y1; — Yoi] = E{1(X/f9 + Bf > £il — 1X) 65 > eil}, 
while the average effect on the treated is 


E[Y1; — Yo;|D; = 1] 
= E{1[Xjf5 + Bf > £il — 11X65 > ed Xjvo + yiz > vi}. 


Given alternative distributional assumptions for v; and ¢;, 
these can be anything. (If the error terms are heteroskedastic, 
then even the sign of these expressions is indeterminate.) 
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By virtue of normality, the average causal effects generated 
by the bivariate probit model are easy to evaluate. The average 
treatment effect is 


E{1[X; 65 + Bf > ei] — 1[X5B5 > ei]} (4.6.15) 
“fof 
Oe Oc 


where ®[-] is the normal CDF. The effect on the treated 
is a little more complicated since it involves the bivariate 
normal CDF: 


E[yii— Yoi|D; = 1] 
XiPetBe Xivgtyiz, Xip Xivgtyiz 
$ d ( iPòtpi Xivòtyi ‘; pev) d ( ibò Xivo+tyi i; pev) 


RA > oy Og ? Oy 
D (A) 


Ov 


(4.6.16) 


The bivariate normal CDF is a canned function in many 
software packages, so this is easy enough to calculate in 
practice. 

Bivariate probit probably qualifies as harmless in the sense 
that it’s not very complicated and easy to get right using pack- 
aged software routines. Still, it shares the disadvantages of 
nonlinear latent index modeling discussed in section 3.4.2. 
First, some researchers become distracted by an effort to 
estimate index coefficients instead of average causal effects. 
For example, a large literature in econometrics is concerned 
with the estimation of index coefficients without the need for 
distributional assumptions. Applied researchers interested in 
causal effects can safely ignore this work.36 


36Suppose the latent error term has an unknown distribution, with CDF 
A[-]. The average causal effect in this case is 


E{A[X/B9 + BE] — ALX/ Bh 1} = A'EX)B3 + Bilt, 


where (by the mean value theorem) f; is a number in [0, Bj]. This always 
depends on the shape of A[-], so it is never enough to know the index 
coefficients alone. 
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A second vice in this context is also a virtue. Bivariate pro- 
bit and other models of this sort can be used to estimate 
unconditional average causal effects and/or effects on the 
treated. In contrast, 2SLS does not promise you average causal 
effects, only local average causal effects. But it should be clear 
from (4.6.15) that the assumed normality of the latent index 
error terms is essential for this. As always, the best you can 
do without a distributional assumption is LATE, the average 
causal effect for compliers. For bivariate probit, we can write 
LATE as 


Ely1;—YoilD1i > Doi] 
= E{1[X}B5 + Bf > ei] 
— 1[X} 65 > ed lXjvg + yi > vi > Xiyòh 


which, like (4.6.16), can be evaluated using joint normality of 
vi and ¢;. But you needn’t bother using normality to evaluate 
E[¥1; — Yo;|D1; > Doj], since LATE can be estimated by IV for 
each X; and averaged using the histogram of the covariates. 
Alternately, do 2SLS and settle for a variance-weighted aver- 
age of covariate-specific LATEs, as described by the saturate 
and weight theorem in section 4.5.3. 

You might be wondering whether LATE is enough. Perhaps 
you would like to estimate the unconditional average treat- 
ment effect or the effect of treatment on the treated and are 
willing to make a few extra assumptions to do so. That’s all 
well and good, but in our experience you can’t get blood from 
a stone, even with heroic assumptions. Since local informa- 
tion is all that’s in the data, in practice, the average causal 
effects produced by bivariate probit are likely to be simi- 
lar to 2SLS estimates, provided the model for covariates is 
sufficiently flexible. This is illustrated in table 4.6.1, which 
reports 2SLS and bivariate probit estimates of the effects of 
a third child on female labor supply using the Angrist-Evans 
(1998) same-sex instruments and the same 1980 census sam- 
ple of married women with two or more children used in their 
paper. The dependent variable is a dummy for having worked 
the previous year; the endogenous variable is a dummy for 
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TABLE 4.6.1 
2SLS, Abadie, and bivariate probit estimates of the effects of a third 
child on female labor supply 


Abadie Estimates Bivariate Probit 
2SLS Linear Probit MFX ATE TOT 
(1) (2) (3) (4) (5) (6) 
A. No Covariates 
—.138 —.138 —.137 —.138 —.139 —.139 
(.029) (.030) (.030) (.029) (.029) (.029) 
B. Some covariates (no age controls) 
—.132 —.132 —.131 —.135 —.135 —.135 
(.029) (.029) (.028) (.028) (.028) (.028) 
C. Some covariates plus age at first birth 
—.129 —.129 —.129 —.133 —.133 —.133 
(.028) (.028) (.028) (.026) (.026) (.026) 
D. Some covariates plus age at first birth and a dummy for age > 30 
—.124 —.125 —.125 —.131 —.131 —.131 
(.028) (.029) (.029) (.025) (.025) (.025) 
E. Some covariates plus age at first birth and age 
—.120 —.121 —.121 —.171 —.171 —.171 
(.028) (.026) (.026) (.023) (.023) (.023) 


Notes: Adapted from Angrist (2001). The table compares 2SLS estimates to 
alternative estimates of the effect of childbearing on labor supply using nonlin- 
ear models. All models use same-sex instruments. Standard errors for the Abadie 
estimates were bootstrapped using 100 replications of subsamples of size 20,000. 
MEX denotes marginal effects; ATE is the unconditional average treatment effect; 
TOT is the average effect of treatment on the treated. 


having a third child. The first-stage effect of a same-sex sib- 
ship on the probability of a third birth is about 7 percentage 
points. 

Panel A of table 4.6.1 reports estimates from a model with 
no covariates. The 2SLS estimate of —.138 in column 1 is 
numerically identical to the Abadie causal effect estimated 
using a linear model in column 2, as it should be in this 
case. Without covariates, the 2SLS slope coefficient provides 
the best linear approximation to the complier causal response 
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function as does Abadie’s kappa-weighting procedure. The 
marginal effect changes little if, instead of a linear approx- 
imation, we use nonlinear least squares with a probit CEF. 
The marginal effect estimated by minimizing 


* *n. 2 
sfa (e-o [aee 
Os 


is —.137, reported in column 3. This is not surprising, since 
the model without covariates imposes no functional form 
assumptions. 

Perhaps more surprising is the fact that marginal effects 
and the average treatment effects calculated using (4.6.15) 
and (4.6.16) are also the same as the 2SLS and Abadie 
estimates. These results are reported in columns 4-6. The 
marginal effect calculated using a derivative to approximate 
to the finite difference in (4.6.15) is —.138 (in column 4, 
labeled MFX for marginal effects), while both average treat- 
ment effects are —.139 in columns 5 and 6. Adding a few 
covariates has little effect on the estimates, as can be seen in 
panel B. In this case, the covariates are all dummy variables, 
three for race (black, Hispanic, other), and two indicating 
first- and second-born boys (the excluded instrument is the 
interaction of these two). Panels C and D show that adding a 
linear term in age at first birth and a dummy for maternal age 
also leaves the estimates unchanged. 

The invariance to covariates seems desirable: because the 
same-sex instrument is essentially independent of covariates, 
control for covariates is unnecessary to eliminate bias and 
should primarily affect precision. Yet, as panel E shows, the 
marginal effects generated by bivariate probit are sensitive to 
the list of covariates. Swapping a dummy indicating mothers 
over 30 with a linear age term increases the bivariate pro- 
bit estimates markedly, to —.171, while leaving 2SLS and 
the Abadie estimators unchanged. This probably reflects the 
fact that the linear age term induces an extrapolation into 
cells where there is little data. Although there is no harm in 
reporting the bivariate probit effects in panel E, it’s hard to 
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see why the more robust 2SLS and Abadie estimators should 
not be preferred.>” 


4.6.4 The Bias of 2SLS* 


It is a fortunate fact that the OLS estimator is not only con- 
sistent, it is also unbiased (as we briefly noted at the end 
of section 3.1.3). This means that in a sample of any size, 
the estimated OLS coefficient vector has a distribution that is 
centered on the population coefficient vector.38 The 2SLS esti- 
mator, in contrast, is consistent, but biased. This means that 
the 2SLS estimator only promises to be close to the causal effect 
of interest in large samples. In small samples, 2SLS estimates 
can differ systematically from the target parameter. 

For many years, applied researchers lived with the knowl- 
edge that 2SLS is biased without losing too much sleep. Neither 
of us heard much about the bias of 2SLS in our graduate 
econometrics classes. A series of papers in the early 1990s 
changed this, however. These papers show that 2SLS esti- 
mates can be highly misleading in cases relevant for empirical 
practice.°*? 

The 2SLS estimator is most biased when the instruments are 
“weak,” meaning the correlation with endogenous regressors 
is low, and when there are many overidentifying restrictions. 
When the instruments are both many and weak, the 2SLS 
estimator is biased toward the probability limit of the cor- 
responding OLS estimate. In the worst-case scenario, when 
the instruments are so weak that there is no first stage in the 
population, the 2SLS sampling distribution is centered on the 


37 Angrist (2001) makes the same point using twins instruments and reports 
a similar pattern in a comparison of 2SLS, Abadie, and nonlinear structural 
estimates of models for hours worked. 

38 A more precise statement is that OLS is unbiased when either (1) the CEF 
is linear or (2) the regressors are nonstochastic, that is, fixed in repeated sam- 
ples. In practice, these qualifications do not seem to matter much. As a rule, 
the sampling distribution of Ê = [)7; X;X/]~! 1°; Xivi, tends to be centered on 
the population analog, B = E[X;X{]-1ELXivil, in samples of any size, whether 
or not the CEF is linear or the regressors are stochastic. 

3°Key references are Nelson and Startz (1990a,b), Buse (1992), Bekker 
(1994), and especially Bound, Jaeger, and Baker (1995). 
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probability limit of OLS. The theory behind this result is a lit- 
tle technical, but the basic idea is easy to see. The source of the 
bias in 2SLS estimates is the randomness in estimates of the 
first-stage fitted values. In practice, the first-stage estimates 
reflect some of the randomness in the endogenous variable, 
since the first-stage coefficients come from a regression of the 
endogenous variable on the instruments. If the population first 
stage is zero, then all randomness in the first stage is due to the 
endogenous variable. This randomness generates finite-sample 
correlation between first-stage fitted values and second-stage 
errors, since the endogenous variable is correlated with the 
second-stage errors (or else you wouldn’t be instrumenting in 
the first place). 

A more formal derivation of 2SLS bias goes like this. To 
streamline the discussion we use matrices and vectors and a 
simple constant-effects model (it’s difficult to discuss bias in a 
heterogeneous effects world, since the target parameter may 
change as the number of instruments changes). Suppose you 
are interested in estimating the effect of a single endogenous 
regressor, stored in a vector x, on a dependent variable, stored 
in the vector y, with no other covariates. The causal model of 
interest can then be written 


y= px+n. (4.6.17) 


The NxQ matrix of instrumental variables is Z, with the 
associated first-stage equation 


x= Zr +é. (4.6.18) 
OLS estimates of (4.6.17) are biased because 7; is corre- 
lated with &;. The instruments Z; are uncorrelated with é; by 


construction and uncorrelated with n; by assumption. 
The 2SLS estimator is 


Êsis = (x'Pzx)~!x'Pzy = B +(x'Pzx)~!x'Pzn, 


where Pz = Z(Z'Z)~'Z’ is the projection matrix that produces 
fitted values from a regression of x on Z. Substituting for x in 
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x'Pzn, we get 


Bosts — B = (x'Pzx) t (x' Z + €')Pzn 
= (x'Pzx)~!n'Z'n + (x'Pzx) !Ł'Pzn. (4.6.19) 


The bias in 2SLS comes from the nonzero expectation of terms 
on the right-hand side. 

The expectation of (4.6.19) is hard to evaluate because 
the expectation operator does not pass through the inverse 
(x/Pzx)~', a nonlinear function. It’s possible to show, how- 
ever, that the expectation of the ratios on the right-hand 
side of (4.6.19) can be closely approximated by the ratio of 
expectations. In other words, 


Effosis — P] © (Elx Pzx]) E[r’ Z'n] + (Elx Pzx]) ELE’ Pzn]. 


This approximation is much better than the usual asymptotic 
approximation invoked in large-sample theory, so we think of 
it as giving us a good measure of the finite-sample behavior of 
the 2SLS estimator.*? Furthermore, because E['Z’é] = 0 and 
E[z’'Z’'n] = 0, we have 


ElBosis — B] © [Eln Z' Zr) + Elé'Pzé)]-E(E'Pzn). 
(4.6.20) 


The approximate bias of 2SLS therefore comes from the fact 
that E(&’Pzn) is not zero unless n; and &; are uncorrelated. But 
correlation between n; and &; is what led us to use IV in the 
first place. 

Further manipulation of (4.6.20) generates an expression 
that is especially useful: 


-1 
a E(x'Z'Z 
El Posts — B] © 2 l (x pale] 


Og K 


40See Bekker (1994) and Angrist and Krueger (1995). This is also called a 
group-asymptotic approximation because it can be derived from an an asymp- 
totic sequence that lets the number of instruments go to infinity at the same 
time as the number of observations goes to infinity, keeping the number of 
observations per instrument (group) constant. 
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(see the appendix for a derivation). The term (1/o7)E(x'Z'Zn)/ 
Q is the F-statistic for the joint significance of all regressors in 
the first stage regression.*! Call this statistic F, so that we can 
write 


E[Bosis — Bl © ae (4.6.21) 


From this we see that as the first stage F-statistic gets small, 
the bias of 2SLS approaches o/o. The bias of the OLS esti- 


mator is o,¢/0;, which also equals on /o7 if x = 0. Thus, we 
have shown that 2SLS is centered on the same point as OLS 
when the first stage is zero. More generally, we can say 2SLS 
estimates are “biased toward OLS estimates” when there isn’t 
much of a first stage. On the other hand, the bias of 2SLS 
vanishes when F gets large, as should happen in large samples 
when x Æ 0. 

When the instruments are weak, the F-statistic varies 
inversely with the number of instruments. To see why, con- 
sider adding useless instruments to your 2SLS model, that is, 
instruments with no effect on the first-stage R?. The model sum 
of squares, E(2’Z’ Zz), and the residual variance, OF will both 
stay the same while Q goes up. The F-statistic becomes smaller 
as a result. From this we learn that the addition of many weak 
instruments increases bias. 

Intuitively, the bias in 2SLS is a consequence of the fact that 
the first-stage is estimated. If the first stage coefficients were 
known, we could use Xp.» = Zz for the first-stage fitted val- 
ues. These fitted values are uncorrelated with the second-stage 
error. In practice, however, we use X = Pzx = Zn + Pzé, 
which differs from Xpop by the term Pzé. The bias in 2SLS 
arises from the fact that Pzé is correlated with n, so some of 


4ISort of; the actual F-statistic is (1/6?)#'Z'Zz/Q, where hats denote 
estimates. (1/of)E(x'Z'Zx)/Q is therefore sometimes called the population 
F-statistic since it’s the F-statistic we’d get in an infinitely large sample. In 
practice, the distinction between population and sample F matters little in 
this context. Some econometricians prefer to multiply the first-stage F by the 
number of instruments when summarizing instrument strength. This product 
is called the “concentration parameter.” 
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the correlation between errors in the first and second stages 
seeps into our 2SLS estimates through the sampling variability 
in z. Asymptotically, this correlation disappears, but real life 
does not play out in asymptopia. 

Formula (4.6.21) shows that, other things equal, the bias in 
2SLS is an increasing function of the number of instruments, 
so bias is least in the just-identified case when the number of 
instruments is as low as it can get. In fact, just-identified 2SLS 
(say, the simple Wald estimator) is approximately unbiased. 
This is hard to show formally because just-identified 2SLS has 
no moments (i.e., the sampling distribution has fat tails). Nev- 
ertheless, even with weak instruments, just-identified 2SLS is 
approximately centered where it should be. We therefore say 
that just-identified 2SLS is median-unbiased. This is not to say 
that you can happily use weak instruments in just-identified 
models. With a weak instrument, just-identified estimates tend 
to be too imprecise to be useful. 

The limited information maximum likelihood (LIML) esti- 
mator is approximately median-unbiased for overidentified 
constant effects models, and therefore provides an attractive 
alternative to just-identified estimation using one instrument 
at a time (see, e.g., Davidson and MacKinnon, 1993, and 
Mariano, 2001). LIML has the advantage of having the same 
asymptotic distribution as 2SLS (under constant effects) while 
providing a finite-sample bias reduction. A number of other 
estimators also reduce the bias in overidentified 2SLS mod- 
els. But an extensive Monte Carlo study by Flores-Lagunes 
(2007) suggests that LIML does at least as well as the alter- 
natives in a wide range of circumstances (in terms of bias, 
mean absolute error, and the empirical rejection rates for t- 
tests). Another advantage of LIML is that many statistical 
packages compute it, while other estimators typically require 
some programming.‘ 


421 IML is available in SAS and in STATA 10. With weak instruments, 
LIML standard errors are not quite right, but Bekker (1994) gives a simple fix 
for this. Why is LIML unbiased? Expression (4.6.21) shows that the approx- 
imate bias of 2SLS is proportional to the bias of OLS. From this we conclude 
that there is a linear combination of OLS and 2SLS that is approximately 
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We use a small Monte Carlo experiment to illustrate some 
of the theoretical results from the discussion above. The 
simulated data are drawn from the following model, 


yi = xi + Ni 
Q 
Xi = > TjZij + Šis 
j=1 


with £ = 1, mı = 0.1, z; = 0 for j = 2,...,Q; and 


(D-n) ls 1) 


where the z are independent, normally distributed random 
variables with mean zero and unit variance. This simulates 
a scenario with one good instrument and Q— 1 worthless 
instruments. The sample size is 1000. 

Figure 4.6.1 shows the Monte Carlo cumulative distribu- 
tion functions of four estimators: OLS, just-identified IV (i.e., 
2SLS with Q = 1, labeled IV; first-stage F = 11.1), 2SLS with 
two instruments (Q = 2, labeled 2SLS; first-stage F = 6.0), 
and LIML with Q =2. The OLS estimator is biased and 
centered around a value of about 1.79. IV is centered around 
1, the value of 8. 2SLS with one weak and one uninfor- 
mative instrument is moderately biased toward OLS (the 
median is 1.07). The distribution function for LIML with 
Q =2 is indistinguishable from that for just-identified IV, 
even though the LIML estimator also uses an uninformative 
instrument. 

Figure 4.6.2 reports simulation results where we set Q = 20. 
Thus, in addition to the one informative but weak instrument, 


unbiased. LIML turns out to be just such a “combination estimator.” Like 
the bias of 2SLS, the approximate unbiasedness of LIML can be shown using 
a Bekker-style group-asymptotic sequence that fixes the ratio of instruments 
to sample size. It is worth mentioning, however, that LIML is biased in mod- 
els with a certain type of heteroskedasticity. See Bekker and van der Ploeg 
(2005) and Hausman, et al. (2008) for details. Unlike LIML, the Jackknife 
IV Estimator (JIVE: see, e.g., Angrist, Imbens, and Krueger, 1999) is Bekker- 
unbiased under heteroskedasticity. Ackerberg and Devereux (2007) recently 
introduced an improved version of JIVE with lower variance. 
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Figure 4.6.1 Monte Carlo cumulative distribution functions of 
OLS, IV (QO = 1), 2SLS (Q = 2), and LIML (QO = 2) estimators. 


we added 19 worthless instruments (first-stage F = 1.51). The 
figure again shows OLS, 2SLS, and LIML distributions. The 
bias in 2SLS is now much worse (the median is 1.53, close 
to the OLS median). The sampling distribution of the 2SLS 
estimator is also much tighter than in the Q = 2 case. LIML 


0 


Figure 4.6.2 Monte Carlo cumulative distribution functions of 
OLS, 2SLS, and LIML estimators with O = 20 instruments. 
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Figure 4.6.3 Monte Carlo cumulative distribution functions 
of OLS, 2SLS, and LIML estimators with O = 20 worthless 
instruments. 


again performs well and is centered around £ = 1, with a bit 
more dispersion than in the Q = 2 case. 

Finally, figure 4.6.3 reports simulation results from a model 
that is truly unidentified. In this case, we set 7 =0; j= 
1,...,20 (first-stage F = 1.0). Not surprisingly, all the sam- 
pling distributions are centered around the same value as OLS. 
On the other hand, the 2SLS sampling distribution is much 
tighter than the LIML distribution. We would say advantage 
LIML in this case because the widely dispersed LIML sam- 
pling distribution correctly reflects the fact that the data are 
uninformative about the parameter of interest. 

What does this mean in practice? Besides retaining a vague 
sense of worry about your first stage, we recommend the 
following: 


1. Report the first stage and think about whether it makes 
sense. Are the magnitude and sign as you would expect, 
or are the estimates too big or wrong-signed? If so, perhaps 
your hypothesized first-stage mechanism isn’t really there, 
rather, you simply got lucky. 
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2. Report the F-statistic on the excluded instruments. The big- 
ger this is, the better. Stock, Wright, and Yogo (2002) 
suggest that F-statistics above about 10 put you in the safe 
zone, though obviously this cannot be a theorem. 

3. Pick your single best instrument and report just-identified 
estimates using this one only. Just-identified IV is median- 
unbiased and therefore unlikely to be subject to a weak 
instruments critique. 

4. Check overidentified 2SLS estimates with LIML. LIML is 
less precise than 2SLS but also less biased. If the results come 
out similar, be happy. If not, worry, and try to find stronger 
instruments or reduce the degree of overidentification. 

5. Look at the coefficients, t-statistics, and F-statistics for 
excluded instruments in the reduced-form regression of 
dependent variables on instruments. Remember that the 
reduced form is proportional to the causal effect of interest. 
Moreover, the reduced-form estimates, since they are OLS, 
are unbiased. As Angrist and Krueger (2001) note, if you 
can’t see the causal relation of interest in the reduced form, 
it’s probably not there.*3 


We illustrate some of this reasoning in a reanalysis of 
data from the Angrist and Krueger (1991) quarter-of-birth 
study. Bound, Jaeger, and Baker (1995) argued that bias is 
a major concern when using quarter of birth as an instrument 
for schooling, even though the sample size exceeds 300,000. 
(“Small sample” is clearly relative.) Earlier in the chapter, we 
saw that the quarter-of-birth pattern in schooling is reflected 
in the reduced form, so there would seem to be little cause for 
concern. On the other hand, Bound, Jaeger, and Baker (1995) 
argue that the most relevant models have additional controls 
not included in these reduced forms. Table 4.6.2 reproduces 
some of the specifications from Angrist and Krueger (1991) as 
well as other specifications in the spirit of Bound, Jaeger, and 
Baker (1995). 


43 recent paper by Chernozhukov and Hansen (2008) formalizes this 
maxim. 


214 Chapter 4 


TABLE 4.6.2 
Alternative IV estimates of the economic returns to schooling 
(1) (2) (3) (4) (5) (6) 
2SLS .105  .435 .089 .076 .093 .091 
(.020) (.450) (.016) (.029) (.009) (.011) 
LIML .106 .539 .093 .081 .106 .110 
(.020) (.627) (.018) (.041) (.012) (.015) 
F-statistic 32.27 .42 4.91 1.61 2.58 1.97 
(excluded instruments) 
Controls 
Year of birth v v v v v 
State of birth v v 
Age, age squared v v v 
Excluded instruments 
Quarter-of-birth dummies v v 
Quarter of birth*year of birth v v v v 
Quarter of birth*state of birth v v 
Number of excluded instruments 3 2 30 28 180 178 


Notes: The table compares 2SLS and LIML estimates using alternative sets of instru- 
ments and controls. The age and age squared variables measure age in quarters. The OLS 
estimate corresponding to the models reported in columns 1—4 is .071; the OLS estimate 
corresponding to the models reported in columns 5 and 6 is .067. Data are from the Angrist 
and Krueger (1991) 1980 census sample. The sample size is 329,509. Standard errors are 
reported in parentheses. 


The first column in the table reports 2SLS and LIML esti- 
mates of a model using three quarter-of-birth dummies as 
instruments, with year-of-birth dummies as covariates. The 
OLS estimate for this specification is 0.071, while the 2SLS 
estimate is a bit higher, at 0.105. The first-stage F-statistic is 
over 32, well out of the danger zone. Not surprisingly, the 
LIML estimate is almost identical to 2SLS in this case. 

Angrist and Krueger (1991) experimented with models that 
include age and age squared measured in quarters as additional 
controls. These controls are meant to pick up omitted age 
effects that might confound the quarter-of-birth instruments. 
The addition of age and age squared reduces the number of 
instruments to two, since age in quarters, year of birth, and 
quarter of birth are linearly dependent. As shown in column 2, 
the first-stage F-statistic drops to 0.4 when age and age squared 
are included as controls, a sure sign of trouble. But the 2SLS 
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standard error is high enough that we would not draw any 
substantive conclusions from this estimate. The LIML estimate 
is even less precise. This model is effectively unidentified. 

Columns 3 and 4 report the results of adding interactions 
between quarter-of-birth dummies and year-of-birth dummies 
to the instrument list, so that there are 30 instruments, or 
28 when the age and age squared variables are included. The 
first-stage F-statistics are 4.9 and 1.6 in these two specifica- 
tions. The 2SLS estimates are a bit lower than in column 1 
and hence closer to OLS. But LIML is not too far away from 
2SLS. Although the LIML standard error is pretty big in col- 
umn 4, it is not so large that the estimate is uninformative. 
On balance, there seems to be little cause for worry about 
weak instruments in 30-instrument models, even with the age 
quadratic included. 

The most worrisome specifications are those reported in 
columns 5 and 6. These estimates were constructed by adding 
150 interactions between quarter of birth and state of birth to 
the 30 interactions between quarter of birth and year of birth. 
The rationale for the inclusion of state-of-birth interactions 
in the instrument list is to exploit differences in compulsory 
schooling laws across states. But this leads to highly overiden- 
tified models with 180 (or 178) instruments, many of which 
are weak. The first stage F-statistics for these models are only 
2.6 and 2.0. On the plus side, the LIML estimates again look 
fairly similar to 2SLS. Moreover, the LIML standard errors are 
not too far above the 2SLS standard errors in this case. This 
suggests that you can’t always determine instrument relevance 
using a mechanical rule, such as “F > 10.” In some cases, a 
low F may not be fatal.*4 

Finally, it’s worth noting that in applications with mul- 
tiple endogenous variables, the conventional first-stage F is 
no longer appropriate. To see why, suppose there are two 
instruments for two endogenous variables and that the first 
instrument is strong and predicts both endogenous variables 


44 Cruz and Moreira (2005) similarly conclude that low F-statistics notwith- 
standing, there is little bias in the Angrist and Krueger (1991) 180-instrument 
specifications. 
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well, while the second instrument is weak. The first-stage 
F-statistics in each of the two first-stage equations are likely 
to be high, but the model is weakly identified, because one 
instrument is not enough to capture two causal effects. A sim- 
ple modification of the first-stage F for this case is given in the 
appendix. 


4.7 Appendix 


Derivation of Equation (4.6.8) 


Rewrite equation (4.6.7) as follows 
Y; = w+ Toti + (mo +71); + vj 


where t; = s; — S;. Since t; and S; are uncorrelated by construc- 
tion, we have: 


pı = To + T1. 
Cov(t;, Yii) 
V(t) 


To = 


Expanding the second line, 


_ Cou[(s; — S;), Yi] 
[V(s;) — V(S})] 


Se 
A on, Sj, Yi) la | 


= pod+pi(l—¢) na i 


V(s;) 


VeVi) is a positive number. Solving for 71, we 


where ¢ = 
also have 


1 = p1 — To = (p1 — Po). 
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Derivation of the Approximate Bias of 2SLS 
Start with (4.6.20): 


ElBosts — P] © (E(x'Z'Zr) + E(é'Pzé))1E(E'Pzn). 


The magic of linear algebra helps us simplify this expression: 
the term &’Pz7 is a scalar and therefore equal to its trace; the 
trace function is a linear operator that passes through expecta- 
tions and is invariant to cyclic permutations; finally, the trace 
of Pz, an idempotent matrix, is equal to its rank, Q. Using 
these facts, and iterating expectations over Z, we have 
E(&'Pzn|Z) = Etr(&’Pzn)|Z] 

= E[tr(Pzné’)|Z] 
= tr(PzE[né"|Z]) 
= tr(Pzoy¢1) 

= Onetr(Pz) 

= OQ, 
where we have assumed that n; and &; are homoskedastic. Sim- 


ilarly, applying the trace trick to E[§’Pzé] shows that this term 
is equal to o7Q. Therefore, 


E[Basis — b] © op QLE(2'Z' Zr) +97 Q\! 


Multivariate First-Stage F-Statistics 


Assume any exogenous covariates have been partialed out of 
the instrument list and there are two endogenous variables, 
xı and x2, with coefficients 5; and 6). We are interested in 
the bias of the 2SLS estimator of 5 when x, is also treated as 
endogenous. The second-stage equation is 


y = Pzx1ô1 + Pzx282 + [n + (x1 — Pzx1)d1 + (x2 — Pzx2)82], 
(4.7.1) 
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where Pzx, and Pzx are the first-stage fitted values from 
regressions of x; and x2 on Z. By the multivariate regression 
anatomy formula, 52 in (4.7.1) is the bivariate regression of 
y on the residual from a regression of Pzx on Pzx,. This 
residual is 


[I — Pzx1(x',Pzx1)7'x,Pz\Pzx2 = M1,Pzx2, 


where Mı: = [I — Pzx1 (x Pzx1) 1x Pz]is the relevant residual- 
maker matrix. Note also that M,,Pz7x2 = Pz[Mj,x2]. 

From here we conclude that the 2SLS estimator of 5) is the 
OLS regression on Pz[M1,x2], in other words, OLS on the 
fitted values from a regression of M1,x2 on Z. This is the same 
as 2SLS using Z to instrument M,,x2. So the 2SLS estimator 
of ô can be written 


[x5 M12P2M1-x2)'x,Mi-Pzy 
= 2 + [x,M1ı:PzM1zx2] "x, MızPzn. 


The first-stage sum of squares (numerator of the F-statistic) 
that determines the bias of the 2SLS estimator of 52 is therefore 
the expectation of [x M,PzM},x2], while 2SLS bias comes 
from the fact that the expectation E[§’M1,Pzn] is nonzero 
when ņ and é are correlated. 

Here’s how to compute this F-statistic in practice: 
(1) Regress the first-stage fitted values for the regressor of 
interest, Pzx2, on the other first-stage fitted values and 
any exogenous covariates. Save the residuals from this step. 
(2) Construct the F-statistic for excluded instruments in a 
first-stage regression of the residuals from (1) on the excluded 
instruments. Note that you should get the 2SLS coefficient of 
interest in a 2SLS procedure where the residuals from (1) are 
instrumented using Z, with no other covariates or endogenous 
variables. Use this fact to check your calculations. 
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Chapter 5 


Parallel Worlds: Fixed Effects, 
Differences-in-Differences, 
and Panel Data 


Mh. 


The first thing to realize about parallel universes... is that 
they are not parallel. 
Douglas Adams, Mostly Harmless 


observed confounding factors. If important confounders 

are unobserved, we might try to get at causal effects using 
instrumental variables, as discussed in chapter 4. Good instru- 
ments are hard to find, however, so we’d like to have other 
tools to deal with unobserved confounders. This chapter con- 
siders a variation on the control theme: strategies that use data 
with a time or cohort dimension to control for unobserved but 
fixed omitted variables. These strategies punt on comparisons 
in levels while requiring the counterfactual trend behavior of 
treatment and control groups to be the same. We also discuss 
the idea of controlling for lagged dependent variables, another 
strategy that exploits timing. 


T: key to causal inference in chapter 3 is control for 


5.1 Individual Fixed Effects 


One of the oldest questions in labor economics is the con- 
nection between union membership and wages. Do workers 
whose wages are set by collective bargaining earn more 
because of this, or would they earn more anyway, perhaps 
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because they are more experienced or skilled? To set this ques- 
tion up, let y;, equal the (log) earnings of worker i at time t, 
and let pj denote his union status. The observed Y; is either 
Yojt OF Y1, depending on union status. Suppose further that 


Ef[yoit|Ai, Xit, t, Die] = ElYoir|Ai, Xit, tl, 


where X; is a vector of observed time-varying covariates and 
A; is a vector of unobserved but fixed confounders that we’ll 
call ability. 

In other words, union status is as good as randomly assigned 
conditional on A; and observed covariates, such as age, 
schooling, and region of residence. 

The key to fixed effects estimation is the assumption that 
the unobserved A; appears without a time subscript in a linear 
model for E(yoj|Aj, Xir, t): 


E[Yoit| Ais Xi, t] = æ + Ar t+ Aly +X, (5.1.1) 


We also assume that the causal effect of union membership is 
additive and constant: 


ElyiitlAi, Xa, t] = Elvoi|Ai, Xit, t] + p. 
Together with (5.1.1), this implies 
E[yit|Ai, Xit, t, Dit] = & + àt + PD +A;y + Xj,B, (5.1.2) 


where p is the causal effect of interest. The set of assumptions 
leading to (5.1.2) is more restrictive than those we used to 
motivate regression in chapter 3; we need the linear, additive 
functional form to make headway on the problem of unob- 
served confounders using panel data with no instruments. 


1In some cases, we can allow heterogeneous treatment effects so that 
E(viit — YoulAi, Xit, t) = pi. 


See, for example, Wooldridge (2005), who discusses estimators for the average 
of Pi. 
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Equation (5.1.2) implies 
Yi = Qi + Àt + pDie + XB + Eir, (5.1.3) 
where & = Yoz — E[Yoir|Ai, Xx, t] and 
a; =a + Ay. 


This is a fixed effects model. Given panel data (repeated obser- 
vations on individuals), the causal effect of union status on 
wages can be estimated by treating a;, the fixed effect, as a 
parameter to be estimated. The year effect, à+, is also treated 
as a parameter to be estimated. The unobserved individual 
effects are coefficients on dummies for each individual, while 
the year effects are coefficients on time dummies.? 

It might seem that there are a lot of parameters to be esti- 
mated in the fixed effects model. For example, the Panel Survey 
of Income Dynamics, a widely used panel data set, includes 
data on about 5,000 working-age men observed for about 
20 years. So there are roughly 5,000 fixed effects. In practice, 
however, this doesn’t matter. Treating the individual effects 
as parameters to be estimated is algebraically the same as esti- 
mation in deviations from means. In other words, first we 
calculate the individual averages, 


Y; = Qj +X + pd) + XB + &j. 
Subtracting this from (5.1.3) gives 


Yit —Y; = Ap — à + o(Dit — Di) + (Xi — Xi)’ B + (Eir — Ei), 
(5.1.4) 


2 An alternative to the fixed effects specification is random effects (see, e.g., 
Wooldridge, 2006). The random effects model assumes that æ; is uncorrelated 
with the regressors. Because the omitted variable in a random effects model 
is uncorrelated with included regressors, there is no bias from ignoring it—in 
effect, it becomes part of the residual. The most important consequence of 
random effects is that the residuals for a given person are correlated across 
periods. Chapter 8 discusses the implications of this for OLS standard errors. 
Random effects models can be estimated by GLS, which promises to be more 
efficient if the assumptions of the random effects model are satisfied. How- 
ever, as in chapter 3, we prefer fixing OLS standard errors to GLS. GLS 
requires stronger assumptions than OLS, and the resulting asymptotic effi- 
ciency gain is likely to be modest, while finite-sample properties may be worse. 
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so deviations from means kills the unobserved individual 
effects.’ 

An alternative to deviations from means is differencing. In 
other words, we estimate, 


AY it = AAt + pADit + AX, B + AE, (5.1.5) 


where the A prefix denotes the change from one year to 
the next. For example, Ay; = Yj, —Yj—1. With two periods, 
differencing is algebraically the same as deviations from 
means, but not otherwise. Both should work, although with 
homoskedastic and serially uncorrelated £x and more than two 
periods, deviations from means is more efficient. You might 
find differencing more convenient if you have to do it by hand, 
though the differenced standard errors should be adjusted for 
the fact that the differenced residuals are serially correlated. 

Some regression packages automate the deviations from 
means estimator, with an appropriate standard error adjust- 
ment for the degrees of freedom lost in estimating N individ- 
ual means. This is all that’s needed to get the standard errors 
right with a homoskedastic, serially uncorrelated residual. The 
deviations from means estimator has many names, including 
the “within estimator” and “analysis of covariance.” Estima- 
tion in deviations from means form is also called absorbing 
the fixed effects.* 

Freeman (1984) uses four data sets to estimate union wage 
effects under the assumption that selection into union status 
is based on unobserved but fixed individual characteristics. 
Table 5.1.1 displays some of his estimates. For each data set, 


3Why is deviations from means the same as estimating each fixed effect 
in (5.1.3)? Because, by the regression anatomy formula, (3.1.3), any set of 
multivariate regression coefficients can be estimated in two steps. To get the 
multivariate coefficient on one set of variables, first regress them on all 
the other included variables, then regress the original dependent variable on 
the residuals from this first step. The residuals from a regression on a full set 
of person-dummies in a person-year panel are deviations from person means. 

4The fixed effects are not estimated consistently in a panel where the number 
of periods T is fixed while N —> oo. This is called the incidental parame- 
ters problem, a name that reflects the fact that the number of parameters 
grows with the sample size. Nevertheless, other parameters in the fixed effects 
model—the ones we care about—are consistently estimated. 
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TABLE 5.1.1 
Estimated effects of union status on wages 


Cross Section Fixed Effects 


Survey Estimate Estimate 
May CPS, 1974-75 19 .09 
National Longitudinal Survey 

of Young Men, 1970-78 28 19 
Michigan PSID, 1970-79 23 14 
QES, 1973-77 14 16 


Notes: Adapted from Freeman (1984). The table reports cross section 
and panel (fixed effects) estimates of the union relative wage effect. 
The estimates were calculated using the surveys listed in the left-hand 
column. The cross section estimates include controls for demographic 
and human capital variables. 


the table displays results from a fixed effects estimator and 
the corresponding cross section estimates. The cross section 
estimates are typically higher (ranging from .14 to .28) than 
the fixed effects estimates (ranging from .09 to .19). This may 
indicate positive selection bias in the cross section estimates, 
though selection bias is not the only explanation for the lower 
fixed effects estimates. 

Although they control for a certain type of omitted variable, 
fixed effects estimates are notoriously susceptible to attenua- 
tion bias from measurement error. On one hand, economic 
variables such as union status tend to be persistent (a worker 
who is a union member this year is most likely a union mem- 
ber next year). On the other hand, measurement error often 
changes from year to year (union status may be misreported or 
miscoded this year but not next year). Therefore, while union 
status may be misreported or miscoded for only a few workers 
in any single year, the observed year-to-year changes in union 
status may be mostly noise. In other words, there is more mea- 
surement error in the differenced regressors in an equation like 
(5.1.4) or (5.1.5) than in the levels of the regressors. This fact 
may account for smaller fixed effects estimates.’ 


>See Griliches and Hausman (1986) for a more complete discussion of 
measurement error in panel data. 


226 Chapter 5 


A variant on the measurement error problem in panel data 
arises from that fact that the differencing and deviations from 
means estimators used to control for fixed effects typically 
remove both good and bad variation. In other words, these 
transformations may kill some of the omitted variables bias 
bathwater, but they also remove much of the useful informa- 
tion in the baby, the variable of interest. An example is the use 
of twins to estimate the causal effect of schooling on wages. 
Although there is no time dimension to this problem, the basic 
idea is the same as the union problem discussed above: twins 
have similar but largely unobserved family and genetic back- 
grounds. We can therefore control for their common family 
background by including a family fixed effect in samples of 
pairs of twins. 

Ashenfelter and Krueger (1994) and Ashenfelter and Rouse 
(1998) estimate the returns to schooling using samples of 
twins, controlling for family fixed effects. Because there are 
two twins from each family, this is the same as regressing 
differences in earnings within twin pairs on differences in 
schooling. Surprisingly, the within-family estimates come out 
larger than OLS estimates. But how do differences in school- 
ing come about between individuals who are otherwise so 
much alike? Bound and Solon (1999) point out that there 
are small differences between twins, with first-borns typically 
having higher birth weight and higher IQ scores (here differ- 
ences in birth timing are measured in minutes). While these 
within-twin differences are not large, neither is the difference 
in their schooling. Hence, small unobserved ability differences 
between twins could be responsible for substantial bias in the 
resulting estimates. 

What should be done about measurement error and related 
problems in models with fixed effects? A possible solu- 
tion to measurement error is to use IV methods. Ashenfel- 
ter and Krueger (1994) use cross-sibling reports to construct 
instruments for schooling differences between twins. For 
example, they use each twin’s report of his brother’s school- 
ing as an instrument for self-reports. A second approach is 
to bring in external information on the extent of measure- 
ment error and adjust naive estimates accordingly. In a study 
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of union wage effects, Card (1996) uses external information 
from a separate validation survey to adjust panel data esti- 
mates for measurement error in reported union status. But 
data from multiple reports and repeated measures of the sort 
used by Ashenfelter and Krueger (1994) and Card (1996) are 
unusual. At a minimum, therefore, it’s important to avoid 
overly strong claims when interpreting fixed effects estimates 
(never bad advice for an applied econometrician in any case). 


5.2 Differences-in-Differences: Pre and Post, 
Treatment and Control 


The fixed effects strategy requires panel data, that is, repeated 
observations on the same individuals (or firms, or whatever the 
unit of observation might be). Often, however, the regressor 
of interest varies only at a more aggregate or group level, such 
as state or cohort. For example, state policies regarding health 
care benefits for pregnant workers may change over time but 
are fixed across workers within states. The source of OVB 
when evaluating these policies must therefore be unobserved 
variables at the state and year level. In some cases, group-level 
omitted variables can be captured by group-level fixed effects, 
an approach that leads to the differences-in-differences (DD) 
identification strategy. 

The DD idea was probably pioneered by physician John 
Snow (1855), who studied cholera epidemics in London in the 
mid-nineteenth century. Snow wanted to establish that cholera 
is transmitted by contaminated drinking water (as opposed to 
“bad air,” the prevailing theory at the time). To show this, 
Snow compared changes in death rates from cholera in dis- 
tricts serviced by two water companies, the Southwark and 
Vauxhall Company and the Lambeth Company. In 1849 both 
companies obtained their water supply from the dirty Thames 
in central London. In 1852, however, the Lambeth Com- 
pany moved its water works upriver to an area relatively free 
of sewage. Death rates in districts supplied by Lambeth fell 
sharply in comparison to the change in death rates in districts 
supplied by Southwark and Vauxhall. 
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To make matters more concrete, let us return to an exam- 
ple from economics. Suppose we are interested in the effect 
of the minimum wage on employment, a classic question in 
labor economics. In a competitive labor market, increases in 
the minimum wage move us up a downward-sloping labor 
demand curve. Higher minimums therefore reduce employ- 
ment, perhaps hurting the very workers minimum wage poli- 
cies were designed to help. Card and Krueger (1994) use a 
dramatic change in the New Jersey state minimum wage to see 
if this is true.° 

On April 1, 1992, New Jersey raised the state minimum 
from $4.25 to $5.05. Card and Krueger collected data on 
employment at fast food restaurants in New Jersey in February 
1992 and again in November 1992. These restaurants (Burger 
King, Wendy’s, and so on) are big minimum wage employers. 
Card and Krueger also collected data from the same type of 
restaurants in eastern Pennsylvania, just across the Delaware 
River. The minimum wage in Pennsylvania stayed at $4.25 
throughout this period. They used their data set to compute 
differences-in-differences (DD) estimates of the effects of the 
New Jersey minimum wage increase. That is, they compared 
the February-to-November change in employment in New 
Jersey to the change in employment in Pennsylvania over the 
same period. 

DD is a version of fixed effects estimation using aggregate 
data. To see this, let y1;.; be fast food employment at restaurant 
i in state s and period t if there is a high state minimum wage, 
and let yYojs: be fast food employment at restaurant / in state s 
and period ż if there is a low state minimum wage. These are 
potential outcomes; in practice, we only get to see one or the 
other. For example, we see Y1;,; in New Jersey in November 
1992. The heart of the DD setup is an additive structure for 
potential outcomes in the no-treatment state. Specifically, we 
assume that 


E[Yoist|s, t] = Ys + Àt, (5.2.1) 


The DD idea was first used to study the effects of minimum wages by 
Obenauer and von der Nienburg (1915), writing for the U.S. Bureau of Labor 
Statistics. 


Fixed Effects, DD, and Panel Data 229 


where s denotes state (New Jersey or Pennsylvania) and t 
denotes period (February, before the minimum wage increase, 
or November, after the increase). This equation says that in 
the absence of a minimum wage change, employment is deter- 
mined by the sum of a time-invariant state effect and a year 
effect that is common across states. The additive state effect 
plays the role of the unobserved individual effect in section 5.1. 

Let Dst be a dummy for high-minimum-wage states and peri- 
ods. Assuming that E[y¥ 1s: — Yois:|s, t] is a constant, denoted ô, 
observed employment, Y;st, can be written: 


Yist = Vs + Mt + ODst + Cisty (5.2.2) 
where E(ejs;|s,t) = 0. From here, we get 


E[yis:|s = PA, t = Nov] — Efyjg.|s = PA, t = Feb] 
= ANov — À Feb 


and 


E[Y;is|s = NJ, t = Nov] — Efyis:|s = NJ, t = Feb] 
= ANov —AFeb + ô. 


The population difference-in-differences, 


{Elvistls = NJ, t = Nov] — Efyis:|s = NJ, t = Feb]} 
— {E[vis|s = PA,t = Nov] — Ef[yix|s = PA, t = Feb]} = ô, 


is the causal effect of interest. This is easily estimated using the 
sample analog of the population means. 

Table 5.2.1 (based on table 3 in Card and Krueger, 1994) 
shows average employment at fast food restaurants in New 
Jersey and Pennsylvania before and after the change in the 
New Jersey minimum wage. There are four cells in the first 
two rows and columns, while the margins show state dif- 
ferences in each period, the changes over time in each state, 
and the difference-in-differences. Employment in Pennsylva- 
nia restaurants is somewhat higher than in New Jersey in 
February but falls by November. Employment in New Jer- 
sey, in contrast, increases slightly. These two changes produce 
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TABLE 5.2.1 
Average employment in fast food restaurants before and after the 
New Jersey minimum wage increase 


PA NJ Difference, NJ — PA 
Variable (i) (11) (111) 
1. FTE employment before, 23.33 20.44 —2.89 
all available observations (1.35) (.51) (1.44) 
2. FTE employment after, 21.17 21.03 —.14 
all available observations (.94) (.52) (1.07) 
3. Change in mean FTE —2.16 59 2.76 
employment (1.25) (.54) (1.36) 


Notes: Adapted from Card and Krueger (1994), table 3. The table reports 
average full-time-equivalent (FTE) employment at restaurants in Pennsylvania 
and New Jersey before and after a minimum wage increase in New Jersey. The 
sample consists of all restaurants with data on employment. Employment at 
six closed restaurants is set to zero. Employment at four temporarily closed 
restaurants is treated as missing. Standard errors are reported in parentheses. 


a positive difference-in-differences, the opposite of what we 
might expect if a higher minimum wage pushed businesses up 
the labor demand curve. 

How convincing is this evidence against the standard labor 
demand story? The key identifying assumption here is that 
employment trends would be the same in both states in the 
absence of treatment. Treatment induces a deviation from this 
common trend, as illustrated in figure 5.2.1. Although the 
treatment and control states can differ, this difference is meant 
to be captured by the state fixed effect, which plays the same 
role as the unobserved individual effect in (5.1.3).” 


7The common trends assumption can be applied to transformed data, for 
example, 


E{In Yoisels, t] = Ys + Àt. 


Note, however, that common trends in logs rule out common trends in levels 
and vice versa. Athey and Imbens (2006) introduce a semiparametric DD esti- 
mator that allows for common trends after an unspecified transformation of 
the dependent variable. Poterba, Venti, and Wise (1995) and Meyer, Viscusi, 
and Durbin (1995) discuss DD-type models for quantiles. 


Fixed Effects, DD, and Panel Data 231 


= 


Employment trend 
in control state 


Employment trend 


in treatment state = 


~~ Treatment 
5s. Effect 

Counterfactual 27 ~~o 

employment trend 

in treatment state 


Employment Rate 


| i > 
Before After Time 


Figure 5.2.1 Causal effects in the DD model. 


The common trends assumption can be investigated using 
data on multiple periods. In an update of their original 
minimum wage study, Card and Krueger (2000) obtained 
administrative payroll data for restaurants in New Jersey and a 
number of Pennsylvania counties. These data are shown here 
in figure 5.2.2, similar to figure 2 in their follow-up study. 
The vertical lines indicate the dates when the original Card 
and Krueger surveys were conducted, and the third vertical 
line indicates the October 1996 increase in the federal min- 
imum wage to $4.75, which affected Pennsylvania but not 
New Jersey. These data give us an opportunity to look at a 
new minimum wage experiment. 

As in the original Card and Krueger survey, the administra- 
tive data show a slight decline in employment from February 
to November 1992 in Pennsylvania, and little change in New 
Jersey over the same period. However, the data also reveal 
substantial year-to-year employment variation in other peri- 
ods. These swings often seem to differ substantially in the two 
states. In particular, while employment levels in New Jersey 
and Pennsylvania were similar at the end of 1991, employment 
in Pennsylvania fell relative to employment in New Jersey over 
the next three years (especially in the 14-county group), mostly 
before the 1996 increase in the federal minimum wage. So 
Pennsylvania may not provide a very good measure of coun- 
terfactual employment rates in New Jersey in the absence of a 
minimum wage change. 
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Figure 5.2.2 Employment in New Jersey and Pennsylvania fast 
food restaurants, October 1991 to September 1997 (from Card and 
Krueger 2000). Vertical lines indicate dates of the original Card and 
Krueger (1994) survey and the October 1996 federal minimum 
wage increase. 


A more encouraging example comes from Pischke (2007), 
who looked at the effect of school term length on student per- 
formance using variation generated by a sharp policy change 
in Germany. Until the 1960s, children in all German states 
except Bavaria started school in the spring. Beginning in the 
1966-67 school year, the spring starters moved to start school 
in the fall. The transition to a fall start required two short 
school years for affected cohorts, 24 weeks long instead of 37. 
Students in these cohorts effectively had their time in school 
compressed relative to cohorts on either side and relative to 
students in Bavaria, which already had a fall start. 

Figure 5.2.3 plots the likelihood of grade repetition for the 
1962-73 cohorts of second graders in Bavaria and affected 
states (there are no repetition data for 1963-65). Repetition 
rates in Bavaria were reasonably flat from 1966 on at around 
2.5 percent. Repetition rates are higher in the short-school- 
year (SSY) states, at around 4-4.5 percent in 1962 and 1966, 
before the change in term length. But repetition rates jump 
up by about a percentage point for the two affected cohorts 
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Figure 5.2.3 Average grade repetition rates in second grade for 
treatment and control schools in Germany (from Pischke, 2007). 
The data span a period before and after a change in term length for 
students outside Bavaria (SSY states). 


in these states, a bit more so for the second cohort than for 
the first, before falling back to the baseline level. This graph 
provides strong visual evidence of treatment and control states 
with a common underlying trend, and a treatment effect that 
induces a sharp but transitory deviation from this trend. A 
shorter school year seems to have increased repetition rates 
for affected cohorts. 


5.2.1 Regression DD 


As with the fixed effects model, we can use regression to 
estimate equations like (5.2.2). Let NJ, be a dummy for restau- 
rants in New Jersey and d; be a time dummy that switches 
on for observations obtained in November (i.e., after the 
minimum wage change). Then 


Yit = +y NJ, +åd; +8(N]J,- di) + Eist (5.2.3) 


is the same as (5.2.2) where NJ, -di = Ds. In the language 
of section 3.1.4, this model includes two main effects for state 
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and year and an interaction term that marks observations from 
New Jersey in November. This is a saturated model, since the 
conditional mean function E(Y;st|s,t) takes on four possible 
values and there are four parameters. The link between the 
parameters in the regression equation, (5.2.3), and those in 
the DD model for the conditional mean function, (5.2.2), is 


a = Efyig|s = PA, t = Feb] = ypa +Arep 
y = Efyis|s = NJ,t = Feb] — E[yissls = PA, t = Feb] 


= Yn] — YPA 
à = Elyis|s = PA, t = Nov] — E[Y;s|s = PA,t = Feb] 
= ANov — AFeb 
5 = {Elvisels = NJ,t = Nov] — Elvials = NJ, t = Feb]} 
— {E[Y;s|s = PA,t = Nov] — Efyj|s = PA, t = Feb]}. 


The regression formulation of the DD model offers a conve- 
nient way to construct DD estimates and standard errors. It’s 
also easy to add additional states or periods to the regression 
setup. We might, for example, add additional control states 
and pretreatment periods to the New Jersey-Pennsylvania sam- 
ple. The resulting generalization of (5.2.3) includes a dummy 
for each state and period but is otherwise unchanged. 

A second advantage of regression DD is that it facilitates 
the study of policies other than those that can be described 
by dummy variables. Instead of New Jersey and Pennsylvania 
in 1992, for example, we might look at all state minimum 
wages in the United States. Some of these are a little higher 
than the federal minimum (which covers everyone regardless 
of where they live), some are a lot higher, and some are the 
same. The minimum wage is therefore a variable with differ- 
ing treatment intensity across states and over time. Moreover, 
in addition to statutory variation in state minima, the local 
importance of a minimum wage varies with average state wage 
levels. For example, the early 1990s federal minimum wage of 
$4.25 an hour was probably irrelevant in Connecticut, with 
high average wages, but a big deal in Mississippi. 

Card (1992) exploits regional variation in the impact of 
the federal minimum wage. His approach is motivated by an 
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equation like 
Yist = Vs + Àt + ô(FAs 5 di) SF Eist» (5.2.4) 


where the variable Fa, is a measure of the fraction of teenagers 
likely to be affected by a minimum wage increase in each state 
and d; is a dummy for observations in 1990, when the federal 
minimum wage increased from $3.35 to $3.80. The Fa, vari- 
able measures the baseline (pre-increase) proportion of each 
state’s teen labor force earning less than $3.80. 

As in the New Jersey-Pennsylvania study, Card (1992) 
works with data from two periods, before and after, in this 
case 1989 and 1990. But this study uses 51 states (includ- 
ing the District of Columbia), for a total of 102 state-year 
observations. Since there are no individual-level covariates in 
(5.2.4), this is the same as estimation with microdata (provided 
the group-level estimates are weighted by cell size). Note that 
FA,-d; is an interaction term, like NJ,- d; in (5.2.3), though 
here the interaction term takes on a distinct value for each 
observation in the data set. Finally, because Card (1992) ana- 
lyzes data for only two periods, the reported estimates are 
from an equation in first differences: 


AY, = A* + SFA, + Ady, 


where AY, is the change in average teen employment in state 
s and Aé, is the error term in the differenced equation.® 
Table 5.2.2, based on table 3 in Card (1992), shows that 
wages increased more in states where the minimum wage 
increase is likely to have had more bite (see the estimate of .15 
in column 1). This is an important step in Card’s analysis—it 
verifies the notion that the Fa, (fraction of affected teens) vari- 
able is a good predictor of the wage changes induced by an 
increase in the federal minimum. Employment, on the other 


8 Other specifications in the spirit of (5.2.4) put a normalized function of 
state and federal minimum wages on the right-hand side instead of FA; - dz. 
See, for example, Neumark and Wascher (1992), who work with the differ- 
ence between state and federal minima, adjusted for minimum wage coverage 
provisions, and normalized by state average hourly wage rates. 
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TABLE 5.2.2 
Regression DD estimates of minimum wage effects on teens, 
1989 to 1990 


Change Change in Teen 
in Mean Log Wage Employment-Population Ratio 
Explanatory Variable (1) (2) (3) (4) 
1. Fraction of 1S 14 .02 —.01 
affected teens (FAs) (.03) (.04) (.03) (.03) 
2. Change in overall — .46 — 1.24 
emp./pop. ratio (.60) (.60) 
3. R? 30 31 01 .09 


Notes: Adapted from Card (1992). The table reports estimates from a regres- 
sion of the change in average teen employment by state on the fraction of teens 
affected by a change in the federal minimum wage in each state. Data are from 
the 1989 and 1990 CPS. Regressions are weighted by the CPS sample size for 
each state. 


hand, seems largely unrelated to FAs, as can be seen in col- 
umn 3. Thus, the results in Card (1992) are in line with the 
results from the New Jersey-Pennsylvania study. 

Card’s (1992) analysis illustrates a further advantage of 
regression DD: it’s easy to add additional covariates in this 
framework. For example, we might like to control for adult 
employment as a source of omitted state-specific trends. In 
other words, we can model counterfactual employment in the 
absence of a change in the minimum wage as 


E[voise|s, t, Xs] = Ys + Ar + XB. 


where Xs is a vector of state- and time-varying covariates, 
including adult employment (though this may not be kosher if 
adult employment also responds to the minimum wage change, 
in which case it’s bad control; see section 3.2.3). As it turns 
out, the addition of an adult employment control has little 
effect on Card’s estimates, as can be seen in columns 2 and 4 
in table 5.2.2. 

It’s worth emphasizing that Card (1992) analyzes state aver- 
ages instead of individual data. He might have used a pooled 
multiyear sample of microdata from the CPS to estimate an 
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equation like 
Yist = Ys + àr + 5(EAs +d) + Xib + Eist, (5.2.5) 


where X;« can include individual level characteristics such as 
race as well as time-varying variables measured at the state 
level. Only the latter are likely to be a source of omitted 
variables bias, but individual-level controls can increase pre- 
cision, a point we noted in section 2.3. Inference is a little 
more complicated in a framework that combines microdata 
on dependent variables with group-level regressors, however. 
The key issue is how best to adjust standard errors for possible 
group-level random effects, as we discuss in chapter 8. 

When the sample includes many years, the regression-DD 
model lends itself to a test for causality in the spirit of Granger 
(1969). The Granger idea is to see whether causes happen 
before consequences, and not vice versa (though as we know 
from the epigraph at the beginning of chapter 4, this alone is 
not sufficient for causal inference). Suppose the policy variable 
of interest, Dst, changes at different times in different states. 
In this context, Granger causality testing means a check on 
whether, conditional on state and year effects, past Dy pre- 
dicts yj; while future Dss does not. If D,, causes Y; but not 
vice versa, then dummies for future policy changes should not 
matter in an equation like 


m q 
Yist = Ys + Àt + > ôr Ds t—r F y b42Dst4r F Xib + Eist, 
t=0 t=1 
(5.2.6) 


where the sums on the right-hand side allow for m lags (8—1, 
5_2,.--,5-m) Or posttreatment effects and q leads (541, 542,...; 
64q) Or anticipatory effects. The pattern of lagged effects is 
usually of substantive interest as well. We might, for example, 
believe that causal effects should grow or fade as time passes. 

Autor (2003) implements the Granger test in an investiga- 
tion of the effect of employment protection on firms’ use of 
temporary help. In the U.S., employment protection is a type 
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of labor law—promulgated by state legislatures or, more typ- 
ically, through common law as made by state courts—that 
makes it harder to fire workers. As a rule, U.S. labor law 
allows employment at will, which means that workers can 
be fired for just cause or no cause, at the employer’s whim. 
But some state courts have allowed a number of exceptions to 
the employment-at-will doctrine, leading to lawsuits for unjust 
dismissal. Autor is interested in whether fear of employee law- 
suits makes firms more likely to use temporary workers for 
tasks for which they would otherwise have increased their 
workforce. Temporary workers are employed by someone else 
besides the firm for which they are executing tasks. As a result, 
firms using them cannot be sued for unjust dismissal when they 
let temporary workers go. 

Autor’s empirical strategy relates the employment of tem- 
porary workers in a state to dummy variables indicating state 
court rulings that allow exceptions to the employment-at-will 
doctrine. His regression-DD model includes both leads and 
lags, as in equation (5.2.6). The estimated leads and lags, run- 
ning from two years ahead to four years behind, are plotted in 
figure 5.2.4, a reproduction of figure 3 from Autor (2003). The 
estimates show no effects in the two years before the courts 
adopted an exception, with sharply increasing effects on tem- 
porary employment in the first few years after the adoption, 
which then appear to flatten out with a permanently higher 
rate of temporary employment in affected states. This pattern 
seems consistent with a causal interpretation of Autor’s results. 

An alternative check on the DD identification strategy adds 
state-specific time trends to the list of controls. In other words, 
we estimate 


Yist = Yos + Vist + At + ôDst +F Xis + Eist, (5.2.7) 


where yos is a state-specific intercept, as before, and 1; is a 
state-specific trend coefficient multiplying the time trend vari- 
able, t. This allows treatment and control states to follow 
different trends in a limited but potentially revealing way. It’s 
heartening to find that the estimated effects of interest are 
unchanged by the inclusion of these trends, and discouraging 
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Figure 5.2.4 The estimated impact of implied-contract exceptions 
to the employment-at-will doctrine on the use of temporary workers 
(from Autor, 2003). The dependent variable is the log of state 
temporary help employment in 1979-1995. Estimates are from a 
model that allows for effects before, during, and after exceptions 
were adopted. 


otherwise. Note, however, that we need at least three periods 
to estimate a model with state-specific trends. Moreover, in 
practice, three periods is typically inadequate to pin down both 
the trends and the treatment effect. As a rule, DD estimation 
with state-specific trends is likely to be more robust and con- 
vincing when the pretreatment data establish a clear trend that 
can be extrapolated into the posttreatment period. 

In a study of the effects of labor regulation on businesses in 
Indian states, Besley and Burgess (2004) use state trends as a 
robustness check. Different states change regulatory regimes 
at different times, giving rise to a DD research design. As in 
Card (1992), the unit of observation in Besley and Burgess 
(2004) is a state-year average. Table 5.2.3 (based on table IV 
in their paper) reproduces the key results. 

The estimates in column 1, from a regression DD model 
without state-specific trends, suggest that labor regulation 
leads to lower output per capita. The models used to construct 
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TABLE 5.2.3 
Estimated effects of labor regulation on the performance of firms 
in Indian states 


(1) (2) (3) (4) 
Labor regulation (lagged) —.186 —.185 —.104 .0002 
(.064) (.051) (.039 (.020 
Log development .240 .184 .241 
expenditure per capita .128) (.119 (.106 
Log installed electricity .089 .082 .023 
capacity per capita .061) (.054 (.033 
Log state population .720 0.310 —1.419 
96) (1.192 (2.326 
Congress majority —.0009 .020 
(.01) (.010 
Hard left majority —.050 —.007 
(.017) (.009 
Janata majority .008 —.020 
(.026) (.033 
Regional majority .006 .026 
(.009) (.023 
State-specific trends No No No Yes 
Adjusted R? 93 93 94 95 


Notes: Adapted from Besley and Burgess (2004), table IV. The table reports 
regression DD estimates of the effects of labor regulation on productivity. The 
dependent variable is log manufacturing output per capita. All models include 
state and year effects. Robust standard errors clustered at the state level are 
reported in parentheses. State amendments to the Industrial Disputes Act are 
coded 1 = pro-worker, 0 = neutral, —1 = pro-employer and then cumulated 
over the period to generate the labor regulation measure. Log of installed 
electrical capacity is measured in kilowatts, and log development expenditure 
is real per capita state spending on social and economic services. Congress, 
hard left, Janata, and regional majority are counts of the number of years 
for which these political groupings held a majority of the seats in the state 
legislatures. The data are for the sixteen main states for the period 1958-92. 
There are 552 observations. 
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the estimates in columns 2 and 3 add time-varying state- 
specific covariates, such as government expenditure per capita 
and state population. This is in the spirit of Card’s (1992) 
addition of state-level adult employment rates as a control in 
the minimum wage study. The addition of controls affects the 
Besley and Burgess estimates little. But the addition of state- 
specific trends kills the labor regulation effect, as can be seen 
in column 4. Apparently, labor regulation in India increased 
in states where output was declining anyway. Control for this 
trend therefore drives the estimated regulation effect to zero. 


Picking Controls 


We’ve labeled the two dimensions in the DD setup states and 
time because this is the archetypical DD example in applied 
econometrics. But the DD idea is much more general. Instead 
of states, the subscript s might denote demographic groups, 
some of which are affected by a policy and others are not. 
For example, Kugler, Jimeno, and Hernanz (2005) look at the 
effects of age-specific employment protection policies in Spain. 
Likewise, instead of time, we might group data by cohort or 
other types of characteristics. An example is Angrist and Evans 
(1999), who studied the effect of changes in state abortion 
laws on teen pregnancy using variation by state and year of 
birth. Regardless of the group labels, however, DD designs 
always set up an implicit treatment-control comparison. The 
question of whether this comparison is a good one deserves 
careful consideration. 

One potential pitfall in this context arises when the com- 
position of the treatment and control groups changes as a 
result of treatment. Going back to a design based on state 
and time comparisons, suppose we’re interested in the effects 
of the generosity of public assistance on labor supply. Histori- 
cally, U.S. states have offered widely varying welfare payments 
to poor unmarried mothers. Labor economists have long been 
interested in the effects of such income maintenance policies: 
how much of an increase in living standards they facilitate, 
and whether they make work less attractive (see, e.g., Meyer 
and Rosenbaum, 2001, for a recent study). A concern here, 
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emphasized in a review of research on welfare by Moffitt 
(1992), is that poor people who would in any case have weak 
labor force attachment might move to states with more gen- 
erous welfare benefits. In a DD research design, this sort of 
program-induced migration tends to make generous welfare 
programs look worse for labor supply than they really are. 

Migration problems can usually be fixed if we know where 
an individual starts out. Say we know state of residence in 
the period before treatment, or state of birth. State of birth or 
previous state of residence are unchanged by the treatment but 
are still highly correlated with current state of residence. The 
problem of migration is therefore eliminated in comparisons 
using these dimensions instead of state of residence. This intro- 
duces a new problem, however, which is that individuals who 
do move are incorrectly located. In practice, however, this is 
easily addressed with the IV methods discussed in chapter 4 
(state of birth or previous residence can be used to construct 
instruments for current location). 

A modification of the two-by-two DD setup with possibly 
improved control groups uses higher-order contrasts to draw 
causal inferences. An example is the extension of Medicaid 
coverage in the United States, studied by Yelowitz (1995). 
Eligibility for Medicaid, the massive U.S. health insurance 
program for the poor, was once tied to eligibility for Aid for 
Families with Dependent Children (AFDC), a large cash wel- 
fare program. At various times in the 1980s, however, some 
states extended Medicaid coverage to children in families ineli- 
gible for AFDC. Yelowitz was interested in how this expansion 
of publicly provided health insurance for children affected, 
among other things, mothers’ labor force participation and 
earnings. 

In addition to state and time, children’s age provides a third 
dimension on which Medicaid policy varies. Yelowitz exploits 
this variation by estimating 


Yiast = Yst F Aat F Oas + Dast + Xiast + Eiasts 


where s index states, t indexes time, and a is the age of 
the youngest child in a family. This model provides full 
nonparametric control for state-specific time effects that are 
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common across age groups (yst), time-varying age effects (Aat), 
and state-specific age effects (Oas). The regressor of interest, 
Dast, indicates families with children in affected age groups in 
states and periods where Medicaid coverage is provided. This 
triple-differences model may generate a more convincing set of 
results than a traditional DD analysis that exploits differences 
by state and time alone. 


5.3 Fixed Effects versus Lagged 
Dependent Variables 


Fixed effects and DD estimators are based on the presump- 
tion of time-invariant (or group-invariant) omitted variables. 
Suppose, for example, we are interested in the effects of par- 
ticipation in a subsidized training program, as in the Dehejia 
and Wahba (1999) and Lalonde (1986) studies discussed in 
section 3.3.3. The key identifying assumption motivating fixed 
effects estimation in this case is 


E[Yoit loti, Xit, Dix] = ElYoirleri, Xit], (5.3.1) 


where q; is an unobserved personal characteristic that deter- 
mines, along with covariates, X;,, whether individual i gets 
training. To be concrete, a; might be a measure of vocational 
skills, though a strike against the fixed effects setup is the 
fact that the exact nature of the unobserved variables typi- 
cally remains somewhat mysterious. In any case, coupled with 
a linear model for E(yoj|a;, Xz), assumption (5.3.1) leads to 
simple estimation strategies involving differences or deviations 
from means. 

For many causal questions, the notion that the most impor- 
tant omitted variables are time invariant doesn’t seem plau- 
sible. The evaluation of training programs is a case in point. 
It’s likely that people looking to improve their labor market 
options by participating in a government-sponsored training 
program have suffered some kind of setback. Many training 
programs explicitly target people who have suffered a recent 
setback, such as men who recently lost their jobs. Consis- 
tent with this, Ashenfelter (1978) and Ashenfelter and Card 
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(1985) find that training participants typically have earnings 
histories that exhibit a preprogram dip. Past earnings is a time- 
varying confounding variable that cannot be subsumed in a 
time-invariant omitted variable like a;. 

The distinctive earnings histories of trainees motivates an 
estimation strategy that controls for past earnings directly 
and dispenses with fixed effects. To be precise, instead of 
(5.3.1), we might base causal inference on the conditional 
independence assumption, 


E[YoitlYit—hs Xits Dit] = ElYoit|Yi—n, Xit]. (5.3.2) 


This is like saying that what makes trainees special is their 
earnings / periods ago. We can then use panel data to estimate 


Yi = @+ OY pp + Àt + OD + Xyp + Eit, (5.3.3) 


where the causal effect of training is 5. To make this more 
general, Yp can be a vector including lagged earnings for 
multiple periods.’ 

Applied researchers using panel data are often faced with the 
challenge of choosing between fixed effects and lagged depen- 
dent variables models, that is, between causal inferences 
based on (5.3.1) and (5.3.2). One solution to this dilemma 
is to work with a model that includes both lagged dependent 
variables and unobserved individual effects. In other words, 
identification might be based on 


E[Yoitlois Yit—hs Xit, Dit] = ElYoitloi, Yih, Xi], (5.3.4) 


which requires conditioning on both a; and y;,_,. We can then 
try to estimate causal effects using a specification like 


Yz = Qi + OYi_p +A, + dDit +X, B + Eit- (5.3.5) 


? Abadie, Diamond, and Hainmueller (2007) develop a semiparametric 
version of the lagged dependent variables model, more flexible than the 
traditional regression setup. As in 5.3.2, the key assumption in this model 
is independence of treatment status and potential outcomes conditional on 
lagged earnings. The Abadie, Diamond, and Hainmuller approach works for 
microdata and for data with a group structure. The Dehejia and Wahba (1999) 
matching strategy also uses lagged dependent variables. 
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Unfortunately, the conditions for consistent estimation of 
6 in equation (5.3.5) are much more demanding than those 
required with fixed effects or lagged dependent variables alone. 
This can be seen in a simple example where the lagged depen- 
dent variable is yj;_1. We kill the fixed effect by differencing, 
which produces 


AV it = OAYit-1 + Adr +SADi + AX), B+ Aer (5.3.6) 


The problem here is that the differenced residual, Agj;, is nec- 
essarily correlated with the lagged dependent variable, Ayjt_1, 
because both are a function of £-1. Consequently, OLS 
estimates of (5.3.6) are not consistent for the parameters in 
(5.3.5), a problem first noted by Nickell (1981). This problem 
can be solved, though the solution requires strong assump- 
tions. The easiest solution is to use Yj_2 as an instrument for 
Ayit_1 in (5.3.6).!° But this requires that y;;_2 be uncorrelated 
with the differenced residuals, Ae. This seems unlikely, since 
residuals are the part of earnings left over after accounting for 
covariates. Most people’s earnings are highly correlated from 
one year to the next, so that past earnings are also likely to 
be correlated with Asy. If ex is serially correlated, there may 
be no consistent estimator for (5.3.6). (Note also that the IV 
strategy using Yj#_2 as an instrument requires at least three 
periods, so we get data for t, t — 1, and t — 2.) 

Given the difficulties that arise when trying to estimate 
(5.3.6), we might ask whether the distinction between fixed 
effects and lagged dependent variables matters. The answer, 
unfortunately, is yes. The fixed effects and lagged dependent 
variables models are not nested, which means we cannot hope 
to estimate one and get the other as a special case if need be. 

So what’s an applied guy to do? One answer, as always, 
is to check the robustness of your findings using alternative 
identifying assumptions. That means that you would like to 
find broadly similar results using plausible alternative mod- 
els. Fixed effects and lagged dependent variables estimates 
also have a useful bracketing property. The appendix to this 


10See Holtz-Eakin, Newey, and Rosen (1988), Arellano and Bond (1991), 
and Blundell and Bond (1998) for details and examples. 
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chapter shows that if (5.3.2) is correct, but you mistakenly 
use fixed effects, estimates of a positive treatment effect will 
tend to be too big. On the other hand, if (5.3.1) is correct and 
you mistakenly estimate an equation with lagged outcomes, 
such as (5.3.3), estimates of a positive treatment effect will 
tend to be too small. You can therefore think of fixed effects 
and lagged dependent variables as bounding the causal effect 
of interest (given some assumptions about the nature of selec- 
tion bias). Guryan (2004) illustrates this sort of reasoning in a 
study estimating the effects of court-ordered busing on black 
students’ high school graduation rates. 


5.4 Appendix: More on Fixed Effects 
and Lagged Dependent Variables 


To simplify, we ignore covariates, intercepts, and year effects 
and assume there are only two periods, with treatment equal 
to zero for everyone in the first period (the punch line is the 
same in a more general setup). The causal effect of interest, 5, 
is positive. Suppose first that treatment (training status) is cor- 
related with an unobserved individual effect, w;, uncorrelated 
with lagged outcome residuals, £;—1, and that outcomes can 
be described by 
Yip = Qi + dDi + Eits (5.4.1) 


where £x is serially uncorrelated, and also uncorrelated with 
a; and Dz. We also have 


Yir-1 = Qi + Ejt-1, 


where a; and £p-1 are uncorrelated. You mistakenly esti- 
mate the effect of Dj in a model that controls for yj_1 but 
ignores fixed effects. The resulting estimator has probability 
limit coer, where Dj = Djt — Y¥iz—1 is the residual from a 
regression of Dj; On Yjz—1. 

Now substitute a; = Y#—1 — €—1 in (5.4.1) to get 


Yit = Yit-1 + óD + €i¢ — Eri 
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From here, we get 


Cov(Yit, Dit). Cov(Eit-1, Dir) 
VÕ) V(Dit) 
Cov (€it-1, Dit — YYit-1) yo; 
=; g Shi wee 
V(Biz) V(Diz) 


where o? is the variance of ep—1. Since trainees have low 
Yit_1,Y < 0, and the resulting estimate of ô is too small. 
Suppose instead that treatment is determined by low Yj_1. 
Causal effects can be estimated using a simplified version of 
(5.3.3), say 
Yin = A+ OY] + Dit + Ext, (5.4.2) 


where £p is serially uncorrelated and uncorrelated with Dj. 
You mistakenly estimate a first-differenced equation in an 
effort to kill fixed effects. This ignores the lagged dependent 
variable. In this simple example, where Dj;_1 = 0 for everyone, 
the first-differenced estimator has probability limit 


Cov(Yi = Yit-1, Dit — Dit-1) _ Cov(Yit — Yit—1, Dit) 
V(Dit — Dir-1) V(Dit) 


(5.4.3) 
Subtracting y;_; from both sides of (5.4.2), we have 
Ya — Yir-1 =Q + (0 — 1)¥ie-1 + SDit + Eir. 


Substituting this in (5.4.3), the inappropriately differenced 
model yields 


Cov(Yit — Yit—1, Dit) 
V(Dit) 


=5+(0 1) | ene 


V(Dit) 


In general, we think 8 is a positive number less than one, other- 
wise Y; is nonstationary (i.e., an explosive time series process). 
Therefore, since trainees have low yj_;, the estimate of ô in 
first differences is too big. Note that in this simple model, 
differencing turns out to be ok in the unlikely event 6 = 1 in 
(5.4.2), but that is not true in general. 
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Chapter 6 


Getting a Little Jumpy: Regression 
Discontinuity Designs 


hl 


But when you start exercising those rules, all sorts of 
processes start to happen and you start to find out all sorts 
of stuff about people. ... It’s just a way of thinking about a 
problem, which lets the shape of the problem begin to 
emerge. The more rules, the tinier the rules, the more 
arbitrary they are, the better. 

Douglas Adams, Mostly Harmless 


precise knowledge of the rules determining treatment. 

RD identification is based on the idea that in a highly 
rule-based world, some rules are arbitrary and therefore 
provide good experiments. RD comes in two styles, fuzzy 
and sharp. The sharp design can be seen as a selection-on- 
observables story. The fuzzy design leads to an instrumental 
variables (IV) type of setup. 


Re discontinuity (RD) research designs exploit 


6.1 Sharp RD 


Sharp RD is used when treatment status is a deterministic 
and discontinuous function of a covariate, x;. Suppose, for 
example, that 


1 if Xi = Xo 
: E (6.1.1) 
0 ifx; < xo 
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where xo is a known threshold or cutoff. This assignment 
mechanism is a deterministic function of x; because once we 
know x; we know D;. Treatment is a discontinuous function 
of x; because no matter how close x; gets to x9, treatment is 
unchanged until x; = xo. 

This may seem a little abstract, so here is an example. 
American high school students are awarded National Merit 
Scholarships on the basis of PSAT scores, a test taken by most 
college-bound high school juniors, especially those who will 
later take the SAT. The question that motivated one of the first 
discussions of RD is whether students who win National Merit 
Scholarships change their career or study plans as a result. 
For example, National Merit Scholars may be more likely to 
go to graduate school (Thistlewaithe and Campbell, 1960; 
Campbell, 1969). Sharp RD compares the graduate school 
attendance rates of students with PSAT scores just above and 
just below the National Merit Award thresholds. In general, 
we might expect students with higher PSAT scores to be more 
likely to go to graduate school, but this effect can be controlled 
by fitting a regression to the relationship between graduate 
school attendance rates and PSAT scores, at least in the neigh- 
borhood of the award cutoff. In this example, jumps in the 
relationship between PSAT scores and graduate school atten- 
dance in the neighborhood of the award threshold are taken 
as evidence of a treatment effect. It is this jump in regression 
lines that gives RD its name.! 

An interesting and important feature of RD, highlighted 
in a recent survey by Imbens and Lemieux (2008), is that 
there is no value of x; at which we get to observe both treat- 
ment and control observations. Unlike full covariate matching 
strategies, which are based on treatment-control comparisons 
conditional on covariate values where there is some overlap, 


'The basic structure of RD designs appears to have emerged simultaneously 
in a number of disciplines but has only recently become important in applied 
econometrics. Cook (2008) gives an intellectual history. In an analysis using 
Lalonde (1986)-style within-study comparisons, Cook and Wong (2008) find 
that RD generally does a good job of reproducing the results from randomized 
trials. 
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the validity of RD turns on our willingness to extrapolate 
across covariate values, at least in a neighborhood of the dis- 
continuity. This is one reason why sharp RD is usually seen 
as distinct from other control strategies. For this same reason, 
we typically cannot afford to be as agnostic about regression 
functional form in the RD world as in the world of chapter 3. 

Figure 6.1.1 illustrates a hypothetical RD scenario where 
those with x; > 0.5 are treated. In panel A, the trend relation- 
ship between outcomes and x; is linear, while in panel B it’s 
nonlinear. In both cases there is a discontinuity in the observed 
CEF, E[y;|x;], around the point xo, while E[yo;|x;] is smooth. 

A simple model formalizes the RD idea. Suppose that 
in addition to the assignment mechanism, (6.1.1), potential 
outcomes can be described by a linear, constant effects model 


E[Yo;lx;] = a + Bx; 
Y1; = Yo; + P. 


This leads to the regression, 
Y; = & + x; + pDi + Ni, (6.1.2) 


where p is the causal effect of interest. The key difference 
between this regression and others we’ve used to estimate treat- 
ment effects (e.g., in chapter 3) is that D;, the regressor of 
interest, not only is correlated with x;, it is a deterministic 
function of x;. RD captures causal effects by distinguishing 
the nonlinear and discontinuous function, 1(x; > xo), from 
the smooth and (in this case) linear function, x;. 

But what if the trend relation, E[yo;|x;], is nonlinear? To 
be precise, suppose that E[yo;|x;] = f(x;) for some reasonably 
smooth function, f(x;). Panel B in figure 6.1.1 suggests there is 
still hope even in this more general case. Now we can construct 
RD estimates by fitting 


Y; = f (xi) + PD; + Nj, (6.1.3) 
where again, D; = 1(x; > xo) is discontinuous in x; at xo. As 


long as f(x;) is continuous in a neighborhood of xo, it should 
be possible to estimate a model like (6.1.3), even with a flexible 


254 Chapter 6 


A. LINEAR E[Y,)|X;] 
1.5 7 


Outcome 


B. NONLINEAR ELY,|X;] 
1.5 


Outcome 


C. NONLINEARITY MISTAKEN FOR DISCONTINUITY 
1.5 z 


= 
le] 
T 


Outcome 
oO 
a 
T 


Figure 6.1.1 The sharp regression discontinuity design. 


functional form for f(x;). For example, modeling f(x;) with a 
pth-order polynomial, RD estimates can be constructed from 


the regression 


Y; = a+ Bix; + Box? +++» + Box? + pd + ni- (6.1.4) 
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A generalization of RD based on (6.1.4) allows different 
trend functions for E[yo;|x;] and E[y4;|x;]. Modeling both of 
these CEFs with pth-order polynomials, we have 


Elyoilxi] = fo(xi) = æ + BorXi + Bork? + +++ + Bop% 
Elyiilxi] = falxi) = «+ p + Bux: + Bik; +--+ + bipi”, 


where x; = x; — xo. Centering x; at xo is a normalization that 
ensures that the treatment effect at x; = x is the coefficient on 
D; in a regression model with interaction terms. 

To derive a regression model that can be used to estimate 
the causal effect of interest in this case, we use the fact that D; 
is a deterministic function of x; to write 


Ely;|x;] = ElYo;lx;] + (Elyiilxi] — Elvojlx;])p;. (6.1.5) 


Substituting polynomials for conditional expectations, we 
then have 


Y; = a + BorX; + Bork? +--+ Bop% 
+ PD; + IDX; + BID:X? + +--+ BRD? + ni, (6.1.6) 


where Bt = B11 — Boi, By = Bi2 — Boz, and BF = Bip — Bop and 
ni is the residual. 

Equation (6.1.4) is a special case of (6.1.6) where Bj = B} = 
P = 0. In the more general model, the treatment effect at 
xj—x9 =c > Qis p + Pict Bic +-+ B3, while the treat- 
ment effect at xo is p. The model with interactions has the 
attraction that it imposes no restrictions on the underlying 
conditional mean functions. But in our experience, RD esti- 
mates of p based on the simpler model, (6.1.4), usually turn 
out to be similar to those based on (6.1.6). This is not sur- 
prising, since either way the estimated p is mostly driven by 
variability in E[y;|x;] in the neighborhood of xo. 

The validity of RD estimates of causal effects based on 
(6.1.4) or (6.1.6) turns on whether polynomial models provide 
an adequate description of E[Yo;|x;]. If not, then what looks 
like a jump due to treatment might simply be an unaccounted- 
for nonlinearity in the counterfactual conditional mean 
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function. This possibility is illustrated in panel C of figure 
6.1.1, which shows how a sharp turn in E[yo;|x;] might be mis- 
taken for a jump from one regression line to another. To reduce 
the likelihood of such mistakes, we can look only at data in a 
neighborhood around the discontinuity, say the interval [xo — 
A, xo + A] for some small positive number A. Then we have 


E[yijlxo — A < x; < xo] > Elvoi|x; = xo] 


E[y;|xo < Xi < Xo + A] & Ely yj|x; = xol, 


so that 


lim ElY;|xo < xi < xo + A] — ElY;|xo — A < xi < xo] 


= E[Y1; — Yo;lx; = xo]. (6.1.7) 


In other words, comparisons of average outcomes in a small 
enough neighborhood to the left and right of xo estimate the 
treatment effect in a way that does not depend on the correct 
specification of a model for E[yo;|x;]. Moreover, the validity 
of this nonparametric estimation strategy does not turn on 
the constant effects assumption, Y1; — Yo; = p; the estimand in 
(6.1.7) is the average causal effect, E[y1; — Yo;|x; = xo]. 

The nonparametric approach to RD requires good estimates 
of the mean of y; in small neighborhoods to the right and left 
of xo. Obtaining such estimates is tricky. The first problem is 
that working in a small neighborhood of the cutoff means you 
don’t have much data. Also, the sample average is biased for 
the CEF in the neighborhood of a boundary (in this case, x0). 
Solutions to these problems include the use of a nonparamet- 
ric version of regression called local linear regression (Hahn, 
Todd, and van der Klaauw, 2001) and the partial linear and 
local polynomial regression estimators developed by Porter 
(2003). Local regression amounts to weighted least squares 
(WLS) estimation of an equation like (6.1.6), with more weight 
given to points close to the cutoff. 

Sophisticated nonparametric RD methods have not yet 
found wide application in empirical practice; most applied RD 
work is still parametric. But the idea of focusing on observa- 
tions near the cutoff value—what Angrist and Lavy (1999) 
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call a “discontinuity sample”—suggests a valuable robustness 
check: Although RD estimates become less precise as the win- 
dow used to select a discontinuity sample gets smaller, the 
number of polynomial terms needed to model f(x;) should go 
down. Hopefully, as you zero in on xo with fewer and fewer 
controls, the estimated effect of D; remains stable.* A second 
important check looks at the behavior of pretreatment vari- 
ables near the discontinuity. Since pretreatment variables are 
unaffected by treatment, there should be no jump in the CEF 
of these variables at xo. 

Lee’s (2008) study of the effect of party incumbency on 
reelection probabilities illustrates the sharp RD design. Lee 
is interested in whether the Democratic candidate for a seat 
in the U.S. House of Representatives has an advantage if his 
party won the seat last time. The widely noted success of House 
incumbents raises the question of whether representatives use 
the privileges and resources of their office to gain advantage for 
themselves or their parties. This conjecture sounds plausible, 
but the success of incumbents need not reflect a real electoral 
advantage. Incumbents—by definition, candidates and par- 
ties who have shown they can win—may simply be better at 
satisfying voters or getting the vote out. 

To capture the causal effect of incumbency, Lee looks at 
the likelihood a Democratic candidate wins as a function of 
relative vote shares in the previous election. Specifically, he 
exploits the fact that an election winner is determined by D; = 
1(x; > 0), where x; is the vote share margin of victory (e.g., the 
difference between the Democratic and Republican vote shares 
when these are the two largest parties). Note that, because D; 
is a deterministic function of x;, there are no confounding 


2Hoxby (2000) also uses this idea to check RD estimates of class size effects. 
A fully nonparametric approach requires data-driven rules for selection of the 
width of the discontinuity-sample window, also known as “bandwidth”. The 
bandwidth must shrink with the sample size at a rate sufficiently slow so as to 
ensure consistent estimation of the underlying conditional mean functions. See 
Imbens and Lemieux (2008) for details. We prefer to think of estimation using 
(6.1.4) or (6.1.6) as essentially parametric: in any given sample, the estimates 
are only as good as the model that you happen to be using. Promises about 
how you might change the model if you had more data should be irrelevant. 
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Figure 6.1.2 The probability of winning an election by past and 
future vote share (from Lee, 2008). (A) Candidate’s probability of 
winning election t+ 1, by margin of victory in election t: local 
averages and logit polynomial fit. (B) Candidate’s accumulated 
number of past election victories, by margin of victory in election t: 
local averages and logit polynomial fit. 


variables other than x;. This is a signal feature of the RD 
setup. 

Figure 6.1.2A, from Lee (2008), shows the sharp RD design 
in action. This figure plots the probability that a Democrat 
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wins against the difference between Democratic and Republi- 
can vote shares in the previous election. The dots in the figure 
are local averages (the average win rate in nonoverlapping 
windows of share margins that are .005 wide); the lines in 
the figure are fitted values from a parametric model with a 
discontinuity at zero. The probability of a Democratic win 
is an increasing function of past vote share. The most impor- 
tant feature of the plot, however, is the dramatic jump in win 
rates at the 0 percent mark, the point where a Democratic 
candidate gets more votes. Based on the size of the jump, 
incumbency appears to raise party reelection probabilities by 
about 40 percentage points. 

Figure 6.1.2B checks the sharp RD identification assump- 
tions by looking at Democratic victories before the last 
election. Democratic win rates in older elections should be 
unrelated to the margin-of-victory cutoff in the last election, 
a specification check that works out well and increases our 
confidence in the RD design in this case. Lee’s investigation of 
pretreatment victories is a version of the idea that covariates 
should be balanced by treatment status as if in a randomized 
trial. A related check examines the density of x; around the 
discontinuity, looking for bunching in the distribution of x; 
near xo. The concern here is that individuals with a stake in 
D; might try to manipulate x; near the cutoff, in which case 
observations on either side may not be comparable (McCrary, 
2008, proposes a formal test for this). Until recently, we would 
have said this is unlikely in election studies like Lee’s. But 
the recount in Florida after the 2000 presidential election sug- 
gests we probably should worry about manipulable vote shares 
when U.S. elections are close. 


6.2 Fuzzy RD Is IV 


Fuzzy RD exploits discontinuities in the probability or 
expected value of treatment conditional on a covariate. The 


3The fitted values in this figure are from a logit model for the probability 
of winning as a function of the cutoff indicator D; = 1(x; > 0), a 4th-order 
polynomial in x;, and interactions between the polynomial terms and Dj. 
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result is a research design where the discontinuity becomes an 
instrumental variable for treatment status instead of determin- 
istically switching treatment on or off. To see how this works, 
let D; denote treatment status as before, though here D; is no 
longer deterministically related to the threshold-crossing rule, 
x; > xo. Rather, there is a jump in the probability of treatment 
at xo, so that 


AP ye 
Paea e WENO, EA 
go(xi) if x; < xo 


The functions go(x;) and g1(x;) can be anything as long as they 
differ (and the more the better) at xo. We’ll assume g1(x0) > 
g0(xo), SO x; > xo makes treatment more likely. We can write 
the relation between the probability of treatment and x; as 


E[p;|x;] = P(D; = 1|x;) = go(xi) + [gi(xi) — go(xi)ITi, 


where 
T; = 1(x; = xo). 


The dummy variable T; indicates the point where E[p,;|x;] is 
discontinuous. 

Fuzzy RD leads naturally to a simple 2SLS estimation strat- 
egy. Assuming that go(x;) and gi(x;) can be described by 
pth-order polynomials as we did for fo(x;) and fı (x;), we have 


E[pilxi] = yoo + yorxi + yo2x; +--+ Yopx; (6.2.1) 

+ [m+ yx + yix? tee + yx TT: 

= Yoo + Yo1Xi + yo2x? ++ Yopx? 

HAT; E VIX YIX Ti + + yX Tis 
where the y*’s are coefficients on the polynomial interactions 
with T;. 

From this we see that T;, as well as the interaction terms 


{XiTi XTi,- 220 TH can be used as instruments for D; in 
(6.1.4).4 


4The idea of using jumps in the probability of assignment as a source of 
identifying information appears to originate with Trochim (1984), although 
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The simplest fuzzy RD estimator uses only T; as an instru- 
ment, without the interaction terms (with the interaction 
terms in the instrument list, we might also like to allow for 
interactions in the second stage as in 6.1.6). The resulting 
just-identified IV estimator has the virtues of transparency and 
good finite-sample properties. The first stage in this case is 


Dj = Yo + Vixi + y2x? Heo + yx? + TT; + b1i, (6.2.2) 


where z is the first-stage effect of T;. 
The fuzzy RD reduced form is obtained by substituting 
(6.2.2) into (6.1.4): 


y; = M+ xj HX +H px? + pat; + Ei, (6.2.3) 


where u = a+ pyo and xj = p; + py; for j = 1,...,p. As with 
sharp RD, identification in the fuzzy case turns on the ability 
to distinguish the relation between y; and the discontinuous 
function, T; = 1(x; > xo), from the effect of polynomial con- 
trols included in the first and second stage. In one of the first 
RD studies in applied econometrics, van der Klaauw (2002) 
used a fuzzy design to evaluate the effects of university finan- 
cial aid awards on college enrollment. In van der Klaauw’s 
study, D; is the size of the financial aid award offer and T; is 
a dummy variable indicating applicants with an ability index 
above predetermined award threshold cutoffs. His fuzzy RD 
estimates control for polynomial functions of this index.° 
Fuzzy RD estimates with treatment effects that change as 
a function of x; can be constructed by 2SLS estimation of an 
equation with treatment-covariate interactions. The second- 
stage model with interaction terms is the same as (6.1.6), 


the IV interpretation came later. Not everyone agrees that fuzzy RD is IV, but 
this view is catching on. Ina recent history of the RD idea, Cook (2008) writes 
about the fuzzy design: “In many contexts, the cutoff value can function as an 
IV and engender unbiased causal conclusions ... fuzzy assignment does not 
seem as serious a problem today as earlier.” 

>Van der Klaauw’s original working paper circulated in 1997. Note that 
the fact that the additive model, (6.2.2), is only an approximation of E[D;|x;] 
is not very important; second-stage estimates are still consistent. 
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while the first stage is similar to (6.2.1), except that to match 
the second-stage parametrization, we center polynomial terms 
at xo. In this case, the excluded instruments are {T;,X;Tj, 
x?1;,...X°1;} while the variables {D;, X;D;, D;X?,...D;X?} are 
treated as endogenous. The first stage for D; becomes 


Di = Yoo + YorxXi + yorx? +--+ TEA 
+ Tj + VERT; + y% Ti+ + PRT + Ej. (6.2.4) 


An analogous first stage must be constructed for each of the 
polynomial interaction terms in the set {X;D,, D;%?, ... DX}. 


i 
The nonparametric version of fuzzy RD consists of IV esti- 
mation in a small neighborhood around the discontinuity. The 


reduced-form conditional expectation of Y; near xo is 
E[Y;|xo < x; < xo + A] — Efy;|xo — A < x; < xo] > pr. 
Similarly, for the first stage for D;, we have 
E[D;|xo < x; < x9 + A] — E[D;ilxo — A < xi < x0] & m. 


Therefore 


E[Y;|xo < x; < xo + A] — E[Y;|xo — A < x; < xo] 


1 — 
A>0 E[D;|x0 < x; < xo + A] — E[Di|xo — A < x; < xo] 
(6.2.5) 


The sample analog of (6.2.5) is a Wald estimator of the sort 
discussed in section 4.1.2, in this case using T; as an instrument 
for D; ina A-neighborhood of xo.° As with other dummy vari- 
able instruments, the result is a local average treatment effect. 
In particular, the Wald estimand for fuzzy RD captures the 
causal effect on compliers, defined as individuals whose treat- 
ment status changes as we move the value of x; from just to 
the left of xo to just to the right of xo. This interpretation of 
fuzzy RD was introduced by Hahn, Todd, and van der Klaauw 


®To allow for changes in slope on either side of the cutoff, Imbens and 
Lemieux (2008) suggest (6.2.5) be computed by 2SLS using T; as an instru- 
ment for D; in a small neighborhood of the cutoff, with the interaction terms 
{XiTi, xT; i Pay} included as exogenous controls. 
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(2001). However, there is another sense in which this version 
of LATE is local: the estimates are for those with x; near xo, a 
feature of sharp nonparametric RD estimates as well. 

Finally, as with the nonparametric version of sharp RD, 
the finite-sample behavior of the sample analog of (6.2.5) is 
not likely to be very good. Hahn, Todd, and van der Klaauw 
(2001) develop a nonparametric IV procedure using local lin- 
ear regression to estimate the top and bottom of the Wald 
estimator with less bias. This takes us back to a 2SLS model 
with linear or polynomial controls, but the model is fit in a 
discontinuity sample using a data-driven bandwidth. The idea 
of using discontinuity samples informally also applies in this 
context: start with a parametric 2SLS setup in the full sample, 
say, based on (6.1.4). Then restrict the sample to points near 
the discontinuity and get rid of most or all of the polynomial 
controls. Ideally, 2SLS estimates in the discontinuity samples 
with few controls will be broadly consistent with the more 
precise estimates constructed using the larger sample. 

Angrist and Lavy (1999) use a fuzzy RD research design 
to estimate the effects of class size on children’s test scores, 
the same question addressed by the STAR experiment dis- 
cussed in chapter 2. Fuzzy RD is an especially powerful and 
flexible research design, a fact highlighted by the Angrist and 
Lavy study, which generalizes fuzzy RD in two ways relative 
to the discussion above. First, the causal variable of interest, 
class size, takes on many values (as in the discussion of aver- 
age causal response an chapter 4). So the first stage exploits 
jumps in average class size instead of probabilities. Second, 
the Angrist and Lavy (1999) research design uses multiple 
discontinuities. 

The Angrist and Lavy study begins with the observation that 
class size in Israeli schools is capped at 40. Students in a grade 
with up to 40 students can expect to be in classes as large as 40, 
but grades with 41 students are split into two classes, grades 
with 81 students are split into three classes, and so on. Angrist 
and Lavy call this “Maimonides’ rule,” since a maximum class 
size of 40 was first proposed by the medieval Talmudic scholar 
Maimonides. To formalize Maimonides’ rule, let mse denote 
the predicted class size (in a given grade) assigned to class 
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c in school s, where enrollment in the grade is denoted e,. 
Assuming grade cohorts are split up into classes of equal size, 
the predicted class size that results from a strict application of 
Maimonides’ rule is 


=e 
int| ee] +1’ 


Msc = 
where int(a) is the integer part of a real number, a. This func- 
tion, plotted with dotted lines in figure 6.2.1 for fourth and 
fifth graders, has a sawtooth pattern with discontinuities (in 
this case, sharp drops in predicted class size) at integer mul- 
tiples of 40. At the same time, msc is clearly an increasing 
function of enrollment, e,, making the enrollment variable an 
important control. 

Angrist and Lavy exploit the discontinuities in Maimonides’ 
rule by constructing 2SLS estimates of an equation like 


Yisc = Ho + ands ae pies F poe? +++ bpe + Psc + Niscs 
(6.2.6) 


where Y;sc is student 7’s test score in school s and class c, 
nsc is the size of this class, and e, is enrollment. In this ver- 
sion of fuzzy RD, msc plays the role of T;,es plays the role 
of x;, and class size, nsc, plays the role of p,;. Angrist and 
Lavy also include a nonenrollment covariate, ds, to control 
for the proportion of students in the school from a disadvan- 
taged background. This is not necessary for RD, since the only 
source of omitted variables bias in the RD model is eṣ, but 
it makes the specification comparable to the model used to 
construct a corresponding set of OLS estimates.” 

Figure 6.2.1 plots the average of actual and predicted class 
sizes against enrollment in fourth and fifth grade. Maimonides’ 
rule does not predict class size perfectly, mostly because some 
schools split grades at enrollments lower than 40. This is what 


7The Angrist and Lavy (1999) study differs modestly from the description 
here in that the data used to estimate equation (6.2.6) are class averages. But 
since the covariates are all defined at the class or school level, the only differ- 
ence between student-level and class-level estimation is the implicit weighting 
by number of students in the student-level estimates. 
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Figure 6.2.1 The fuzzy-RD first-stage for regression-discontinuity 
estimates of the effect of class size on test scores (from Angrist and 
Lavy, 1999). 


makes the RD design fuzzy. Still, there are clear drops in class 
size at enrollment levels of 40, 80, and 120. Note also that 
the ms: instrument implicitly combines both discontinuities 
and slope-discontinuity interactions such as X;T; in (6.2.4) ina 
single variable (ms: becomes a shallower function of e, above 
each kink). This compact parametrization comes from a spe- 
cific understanding of the institutions and rules that determine 
Israeli class size. 

Estimates of equation (6.2.6) for fifth-grade math scores are 
reported in table 6.2.1, beginning with OLS. With no controls, 
there is a strong positive relationship between class size and 
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TABLE 6.2.1 
OLS and fuzzy RD estimates of the effect of class size on 
fifth-grade math scores 


OLS 2SLS 
Full Sample Discontinuity Samples 
+5 +3 
(1) (2) (3) (4) (5) (6) (7) (8) 
Mean score 67.3 67.3 67.0 67.0 
(SD) (9.6) (9.6) (10.2) (10.6) 
Regressors 
Class size 322 .076 .019 230 261 185 443 .270 
(.039) (.036) (.044) (.092) (.113) (151) (.236) (.281) 
Percent 340 332 350 350 459 435 
disadvantaged (.018) (.018) (.019) (.019) (.049)  (.049) 
Enrollment .017 041 .062 .079 
(.009) (.012) (.037) (.036) 
Enrollment —.010 
squared/100 (.016) 
Segment 1 —12.6 
(enrollment 38-43) (3.80) 
Segment 2 —2.89 
(enrollment 78-83) (2.41) 
R? .048 249 252 
Number of classes 2,018 2,018 471 302 


Notes: Adapted from Angrist and Lavy (1999). The table reports estimates of equation 
(6.2.6) in the text using class averages. Standard errors, reported in parentheses, are corrected 
for within-school correlation. 


test scores. Most of this vanishes however, when the percent 
disadvantaged in the school is included as a control. The 
positive correlation between class size and test scores shrinks 
to insignificance when enrollment is added as an additional 
control, as can be seen in column 3. Still, there is no evidence 
that smaller classes are better, as the results from the Tennessee 
STAR randomized trial would lead us to expect. 

In contrast to the OLS estimates in column 3, 2SLS estimates 
of similar specifications using Msc as an instrument for Msc 
strongly suggest that smaller classes increase test scores. These 
results, reported in column 4 for models that include a linear 
enrollment control and in column 5 for models that include a 
quadratic enrollment control, range from —.23 to —.26, with 
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standard error around .1. These results suggest a seven-student 
reduction in class size (as in Tennessee STAR) raises math 
scores by about 1.75 points, for an effect size of .180, where 
o is the standard deviation of class average scores. This is not 
too far from the Tennessee estimates. 

Importantly, the functional form of the enrollment control 
does not seem to matter very much (though estimates with no 
controls, not reported in the table, come out much smaller 
and insignificant). Columns 6 and 7 check the robustness of 
the main findings further using a +5 discontinuity sample. 
Not surprisingly, these results are much less precise than those 
reported in columns 4 and 5 since they were estimated with 
only about one-quarter of the data used to construct the full- 
sample estimates. Still, they bounce around the —.25 mark. 
Finally, the last column shows the results of estimation using 
an even narrower discontinuity sample limited to schools with 
an enrollment of plus or minus three students around the dis- 
continuities at 40, 80, and 120 (with dummy controls for 
which of these discontinuities is relevant). These are Wald 
estimates in the spirit of Hahn, Todd, and van der Klaauw 
(2001) and formula (6.2.5); the instrument used to construct 
these estimates is a dummy for being in a school with enroll- 
ment just to the right of the relevant discontinuity. The result 
is an imprecise —.270, but still strikingly similar to the other 
estimates in the table. This set of estimates illustrates the price 
to be paid in terms of precision when we shrink the sample 
around the discontinuities. Happily, however, the picture that 
emerges from table 6.2.1 is fairly clear. 
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Here’s a prayer for you. Got a pencil? .. . “Protect me from 
knowing what I don’t need to know. Protect me from even 
knowing that there are things to know that I don’t know. 
Protect me from knowing that I decided not to know about 
the things I decided not to know about. Amen.” There’s 
another prayer that goes with it. “Lord, lord, lord. Protect 
me from the consequences of the above prayer.” 

Douglas Adams, Mostly Harmless 


ightly or wrongly, 95 percent of applied econometrics is 

concerned with averages. If, for example, a training pro- 

gram raises average earnings enough to offset the costs, 
we are happy. The focus on averages is partly because it’s hard 
enough to produce good estimates of average causal effects. 
And if the dependent variable is a dummy for something like 
employment, the mean describes the entire distribution. But 
many variables, such as earnings and test scores, have contin- 
uous distributions. These distributions can change in ways not 
revealed by an examination of averages; for example, they can 
spread out or become more compressed. Applied economists 
increasingly want to know what is happening to an entire 
distribution, to the relative winners and losers, as well as to 
averages. 

Policy makers and labor economists have been especially 
concerned with changes in the wage distribution. We know, 
for example, that flat average real wages are only a small part 
of what’s been going on in the labor market for the past 25 
years. Upper earnings quantiles have been increasing, while 
lower quantiles have been falling. In other words, the rich 
are getting richer and the poor are getting poorer. Recently, 
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inequality has grown asymmetrically; for example, among col- 
lege graduates, it’s mostly the rich getting richer, with wages at 
the lower decile unchanging. The complete story of the chang- 
ing wage distribution is fairly complicated and would seem to 
be hard to summarize. 

Quantile regression is a powerful tool that makes the task of 
modeling distributions easy, even when the underlying story 
is complex and multidimensional. We can use this tool to 
see whether participation in a training program or member- 
ship in a labor union affects earnings inequality as well as 
average earnings. We can also check for interactions, such as 
whether and how the relation between schooling and inequal- 
ity has been changing over time. Quantile regression works 
very much like conventional regression: confounding factors 
can be held fixed by including covariates; interaction terms 
work similarly too. And sometimes we can even use instrumen- 
tal variables methods to estimate causal effects on quantiles 
when a selection-on-observables story seems implausible. 


7.1 The Quantile Regression Model 


The starting point for quantile regression is the conditional 
quantile function (CQF). Suppose we are interested in the dis- 
tribution of a continuously distributed random variable, y;, 
with a well-behaved density (no gaps or spikes). Then the CQF 
at quantile t given a vector of regressors, X;, can be defined 
as: 

O, (vi|Xi) = Fy *(t[Xi) 


where F,(y|X;) is the distribution function for Y; at y, condi- 
tional on X;. When t = .10, for example, O,(y;|X;) describes 
the lower decile of y; given X;, while t = .5 gives us the con- 
ditional median.! By looking at the CQF of earnings as a 
function of education, we can tell whether the dispersion in 


'More generally, we can define the CQF for discrete random variables and 
random variables with less than well-behaved densities as 


QO, (¥i|Xi) = inf {y : Fy(y|X;) = T}. 
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earnings goes up or down with schooling. The CQF of earn- 
ings as a function of education and time tells us whether the 
relationship between schooling and inequality is changing over 
time. 

The CQF is the conditional quantile version of the condi- 
tional expectation function (CEF). Recall that the CEF can 
be derived as the solution to a mean-squared error prediction 
problem, 


Ely;|Xi] = arg min E[(y; — m(X;))"1. 
m(X;) 
In the same spirit, the CQF solves the following minimization 
problem, 


Q,(y;|Xi) = arg min E[p;(¥; — q(Xi))], (7.1.1) 
q(X) 

where p,(u) = (t — 1(u < 0))u is called the “check function” 
because it looks like a check-mark when you plot it. If t = 
.5, this becomes least absolute deviations because p.5(u) = 
s(sign u)u = 4|u|. In this case, Q+(Y;|X;) is the conditional 
median since the conditional median minimizes absolute devi- 
ations. Otherwise, the check function weights positive and 
negative terms asymmetrically: 


p:(u) = 1(u > O)-t |u| + 1(u < 0)-(1—T)|ul. 


This asymmetric weighting generates a minimand that picks 
out conditional quantiles (a fact that is not immediately obvi- 
ous but can be proved with a little work; see Koenker, 2005). 

With continuous or high-dimensional X;, the CQF shares 
the disadvantages of the CEF: it may be hard to estimate and 
summarize. We’d therefore like to boil this function down to 
a small set of numbers, one for each element of X;. Quantile 
regression accomplishes this by substituting a linear model for 
q(X;) in (7.1.1), producing 


Br = arg min E[p,(y; — X;b)]. (7.1.2) 
b 


The quantile regression estimator, B,, is the sample ana- 
log of (7.1.2). It turns out this minimization is a linear 
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programming problem that is fairly easy (for computers) to 
solve. 

Just as OLS fits a linear model to y; by minimizing expected 
squared error, quantile regression fits a linear model to y; using 
the asymmetric loss function, p,(u). If O,(y;|X;) is in fact lin- 
ear, the quantile regression minimand will find it (just as if the 
CEF is linear, OLS will find it). The original quantile regres- 
sion model, introduced by Koenker and Bassett (1978), was 
motivated by the assumption that the CQF is linear. As it turns 
out, however, the assumption of a linear CQF is unnecessary: 
quantile regression is useful whether or not we believe this. 

Before turning to a more general theoretical discussion of 
quantile regression, we illustrate the use of this tool to study 
the wage distribution. The motivation for the use of quan- 
tile regression to look at the wage distribution comes from 
labor economists’ interest in the question of how inequality 
varies conditional on covariates like education and experience 
(see, e.g., Buchinsky, 1994). The overall gap in earnings by 
schooling group (e.g., the college wage premium) grew consid- 
erably in the 1980s and 1990s. Less clear, however, is how the 
wage distribution was changing within education and experi- 
ence groups. Many labor economists believe that increases in 
within-group inequality provide especially strong evidence of 
fundamental changes in the labor market, not easily accounted 
for by changes in institutional features such as the percentage 
of workers who belong to labor unions. 

Table 7.1.1 reports schooling coefficients from quantile 
regressions estimated using the 1980, 1990, and 2000 cen- 
suses. The models used to construct these estimates control 
for race and a quadratic function of potential labor market 
experience (defined as age — education — 6). The .5 quantile 
coefficients—for the conditional median—are close to the OLS 
coefficients in the far right columns. For example, the OLS 
estimate of .072 in the 1980 census is not very different from 
the .5 quantile coefficient of about .068 in the same data. If 
the conditional-on-covariates distribution of log wages is sym- 
metric, so that the conditional median equals the conditional 
mean, we should expect these two coefficients to be the same. 
Also noteworthy is that fact that the quantile coefficients are 
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TABLE 7.1.1 
Quantile regression coefficients for schooling in the 1980, 
1990, and 2000 censuses 


Desc. Stats. Quantile Regression Estimates OLS Estimates 
Census Obs. Mean SD 01 025 0.5 0.75 0.9 Coeff. Root MSE 
1980 65,023 64 .67 .074 .074 .068 .070 .079  .072 63 
(.002) (.001) (.001) (.001) (.001) (.001) 
1990 86,785 6.5 .69 112 .110 .106 .111 .137 .114 64 
(.003) (.001) (.001) (.001) (.003) (.001) 
2000 97,397 6.5 .75 .092 105 .111 .120 .157 1114 69 


(.002) (.001) (.001) (.001) (.004) (.001) 


Notes: Adapted from Angrist, Chernozhukov, and Fernandez-Val (2006). The table 
reports quantile regression estimates of the returns to schooling in a model for log wages, 
with OLS estimates shown at the right for comparison. The sample includes U.S.-born white 
and black men aged 40-49. The sample size and the mean and standard deviation of log 
wages in each census extract are shown at the left. Standard errors are reported in paren- 
theses. All models control for race and potential experience. Sampling weights were used 
for the 2000 census estimates. 


similar across quantiles in 1980. An additional year of school- 
ing raises median wages by 6.8 percent, with slightly higher 
effects on the lower and upper quartiles of the conditional 
wage distribution equal to .074 and .070. Although the esti- 
mated returns to schooling increased sharply between 1980 
and 1990 (up to .106 at the median, with an OLS return 
of .114 percent), there is still a reasonably stable pattern of 
returns across quantiles in the 1990 census. The largest effect 
is on the upper decile, a coefficient of .137, while the other 
quantile coefficients are around .11. 

We should expect to see constant coefficients across quan- 
tiles if the effect of schooling on wages amounts to what is 
sometimes called a location shift. Here, this means that as 
higher schooling levels raise average earnings, other parts 
of the wage distribution move in tandem (i.e., within-group 
inequality does not change). Suppose, for example, that 
log wages can be described by a classical linear regression 
model: 


Y; ~ N(X‘B, 07), (7.1.3) 
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where Efy;|X;] = X; and y;—X;6 = £; is a normally dis- 
tributed error with constant variance o. Homoskedasticity 
means the conditional distribution of log wages is no more 
spread out for college graduates than for high school gradu- 
ates. The implications of the linear homoskedastic model for 
quantiles are apparent from the fact that 


Ply; -Xi <o,@7'(t)|Xi] = 7, 


where ®~!(r) is the inverse of the standard normal CDF. From 
this we conclude that O,(y;|X;) = X6 +o,®71!(r). In other 
words, apart from the changing intercept, o,@7!(t), quan- 
tile regression coefficients are the same at each quantile. The 
results in table 7.1.1 for 1980 and 1990 are not too far from 
this stylized representation. 

In contrast to the simple pattern in 1980 and 1990 census 
data, quantile regression estimates from the 2000 census dif- 
fer markedly across quantiles, especially in the right tail. An 
additional year of schooling raises the lower decile of wages 
by 9.2 percent, the median by 11.1 percent, and the upper 
decile by 15.7 percent. Thus, in addition to increases in overall 
inequality in the 1980s and 1990s (a fact we know from sim- 
ple descriptive statistics), by 2000, inequality began to increase 
with education as well (since a pattern of increasing school- 
ing coefficients across quantities means the wage distribution 
spreads out as education increases). This development is the 
subject of considerable discussion among labor economists, 
who are particularly concerned with whether it points to fun- 
damental or institutional changes in the labor market (see, e.g., 
Autor, Katz, and Kearney, 2005, and Lemieux, 2008). 

A parametric example helps us see the link between quantile 
regression coefficients and conditional variance. Specifically, 
we can generate increasing quantile regression coefficients 
by adding heteroskedasticity to the classic normal regression 
model, (7.1.3). Suppose that 


Yi~ N(X;B, o*(Xj)), 


where o7(X;) = (A’X;)* and å is a vector of positive coefficients 
such that A’X; > 0 (perhaps proportional to 6, so that the 
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conditional variance grows with the conditional mean).” Then 
Ply; —X{B < ('Xj)©7"(z)|Xi] = t, 
with the implication that 


Or(viIXi) = XB + (AX) (t) = XB +207 |(r)]. 
(7.1.4) 
so that quantile regression coefficients increase across quan- 
tiles with 6B, = B+A®~!(r). 

Putting the pieces together, table 7.1.1 neatly summa- 
rizes two stories, both related to variation in within-group 
inequality. First, results from the 2000 census show inequality 
increasing sharply with education. The increase is asymmet- 
ric, however, and appears much more clearly in the upper 
tail of the wage distribution. Second, this increase is a new 
development. In 1980 and 1990, schooling affected the wage 
distribution in a manner roughly consistent with a simple 
location shift. 


7.1.1 Censored Quantile Regression 


Quantile regression allows us to look at features of the condi- 
tional distribution of y; when part of the distribution is hidden. 
Suppose you have have data of the form 


Yiobs = Y; : l[yj < c]+ c- 1[Y; = c], (7.1.5) 


?See Card and Lemieux (1996) for an empirical example of a regression 
model with this sort of heteroskedasticity. Koenker and Portnoy (1996) call 
this a linear location-scale model. 

3The formula for asymptotic quantile regression standard errors assuming 
a linear CQF is 


(1 —c){Elfu,(O|Xi)XiX4]7 | EEX XE fie, (OLX) XX, 


where f,,,(0|X;) is the conditional density of the quantile regression residual 


at zero. If the residuals are homoskedastic this simplifies to an ELX;X{}-1, 


where f2,(0) is the square of the unconditional residual density. Angrist, Cher- 
nozhukov, and Fernandez-Val (2006) give a more general formula allowing 
the CQF to be nonlinear. 
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where Y; obs is what you get to see and Y; is the variable you 
would like to see. The variable Y; obs is censored—information 
about Y; in Y;obs is limited for confidentiality reasons or 
because it was too difficult or time-consuming to collect more 
information. In the CPS, for example, high earnings are top- 
coded to protect respondent confidentiality. This means that 
data above the topcode are recoded to have the topcode value. 
Duration data may also be censored: in a study of the effects 
of unemployment insurance on the duration of employment, 
we might follow new claimants for up to 40 weeks. Any- 
one out of work for longer has an unemployment spell length 
that is censored at 40. Note that limited dependent variables 
such as hours worked or medical expenditure, discussed in 
section 3.4.2, are not censored; they take on the value zero 
by their nature, just as dummy variables such as employment 
status do. 

When dealing with censored dependent variables, quantile 
regression can be used to estimate the effect of covariates 
on conditional quantiles that are below the censoring point 
(assuming censoring is from above). This reflects the fact that 
censoring earnings above, say, the median has no effect on the 
median. So if CPS topcoding affects relatively few people (as 
is often true), censoring has no effect on estimates of the con- 
ditional median or even 6, for t = .75. Likewise, if less than 
10 percent of the sample is censored conditional on all values 
of X;, then, when estimating £, for t up to .9, you can sim- 
ply ignore censoring. Alternatively, you can limit the sample to 
values of X; where O,(y;[X;) is below c (or above, if censoring 
is from the bottom with Y; obs = Y; - 1[¥; > c]+ c- 1[Y; < c]). 

Powell (1986) formalizes this idea with the censored quan- 
tile regression estimator. Because we may not know which con- 
ditional quantiles are below the censoring point (continuing to 
think of top codes for example), Powell proposes we work with 


Q,(¥i|Xi) = min (c, Xj Bf). 
The parameter vector B¢ solves 


BS = arg min E{1[X;b < c]- p,(¥; — Xjb)}. (7.1.6) 
b 
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In other words, we solve the quantile regression minimiza- 
tion problem for values of X; such that X‘B¢ < c. (In practice, 
we minimize the sample analog of (7.1.6).) As long is there 
are enough uncensored data, the resulting estimates give us 
the quantile regression function we would have gotten had 
the data not been censored (assuming the conditional quantile 
function is, in fact, linear). And if it turns out that the condi- 
tional quantiles you are estimating are all below the censoring 
point, then you are back to regular quantile regression. 

The sample analog of (7.1.6) is no longer a linear pro- 
gramming problem, but Buchinsky (1994) proposes a simple 
iterated linear programming algorithm that seems to work. 
The iterations go like this. First estimate B¢ ignoring the cen- 
soring. Then find the cells with X/B¢ < c. Then estimate the 
quantile regression again using these cells only, and so on. This 
algorithm is not guaranteed to converge, but it appears to do 
so in practice. Standard errors can be bootstrapped. Buchinsky 
(1994) used this approach to estimate the returns to schooling 
for highly experienced workers who may have earnings above 
the CPS top code. The censoring adjustment tends to increase 
the returns to schooling for this group.* 


7.1.2 The Quantile Regression 
Approximation Property* 


The CQF of log wages given schooling is unlikely to be exactly 
linear, so the assumptions of the original quantile regression 
model fail to hold in this example. Luckily, quantile regression 
can be understood as giving a MMSE linear approximation 
to the CQF, though in this case the approximation is a little 
more complicated and harder to derive than the regression- 
CEF theorem. For any quantile index t € (0,1), define the 
quantile regression specification error as: 


A(X; Br) = X' Br = O,(y¥;|X;). 


4See Buchinsky and Hahn (1998) and Chernozhukov and Hong (2002) for 
more sophisticated estimators with better theoretical properties. 
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The population quantile regression vector can be shown to 
minimize an expected weighted average of the squared specifi- 
cation error, A?(X;, Br), as described in the following theorem 
from Angrist, Chernozhukov, and Fernandez-Val (2006): 


Theorem 7.1.1 OQuantile Regression Approximation. 
Suppose that 0 4 conditional density fy(y|X;) exists almost 
surely, (ti) Ely;], El[OQ,(y;|X;)], and E||X;|| are finite, and 
(iti) Br apie ip (7.1.2). Then 


Br = arg min E[w,(Xj, b)- AZ(X;, b)], (7.1.7) 
b 


1 
tS Be ce ae 
0 


= 1- u) -fy(u- Xib+(1 )- O7( Yi[X;) |X;)du 


and ¢€;(t) is a quantile-specific residual, 
eilt) = Y; — O,(y¥;|X;), 


with conditional density fe(r)(e[Xi) at €;(t) =e. Moreover, 
when Y; has a smooth conditional density, we have for B in 


the neighborhood of B;: 
w,(Xi, B) ~ 1/2 - fy(Qx(¥i|Xi)|Xi). (7.1.8) 


The quantile regression approximation theorem looks com- 
plicated, but the big picture is simple. We can think of quantile 
regression as approximating O,(y;|X;), just as OLS approxi- 
mates E[y;|X;]. The OLS weighting function is the histogram 
of X;, denoted P(X;). The quantile regression weighting func- 
tion, implicitly given by w,(X;, B;) - P(X;), is more elaborate 
than P(X;) alone (the histogram is implicitly part of the 
quantile regression weighting function because the expec- 
tation in (7.1.7) is over the distribution of X;). The term 
w,(X;, Br) involves the quantile regression vector, 6+, but can 
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be rewritten with 6, partialed out so that it is a function 
of X; only (see Angrist, Chernozhukov, and Fernandez-Val, 
2006, for details). In any case, the quantile regression weights 
are approximately proportional to the density of y; in the 
neighborhood of the CQF. 

The quantile regression approximation property is illus- 
trated in figure 7.1.1, which plots the conditional quantile 
function of log wages given highest grade completed using 
1980 census data. Here we take advantage of the discrete- 
ness of schooling and large census samples to estimate the 
CQF nonparametrically by computing the quantile of wages 
for each schooling level. Panels A-C plot a nonparametric esti- 
mate of O,(y;|X;) along with the linear quantile regression fit 
for the 0.10, 0.50, and 0.90 quantiles, where X; includes only 
the schooling variable and a constant. The nonparametric cell- 
by-cell estimate of the CQF is plotted with circles in the figure, 
while the quantile regression line is solid. The figure shows 
how linear quantile regression approximates the CQF. 

It’s also interesting to compare quantile regression to a 
histogram-weighted fit to the CQF, similar to that provided 
by OLS for the CEF. The histogram-weighted approach to 
quantile regression was proposed by Chamberlain (1994). The 
Chamberlain minimum distance (MD) estimator is the sample 
analog of the vector B, obtained by solving 


br = arg min E[(O,;(y;|X;) — Xb] 
b 


= arg min E[A?(X,, b)]. 
b 


In other words, +, is the slope of the linear regression of 
O,(y;|X;) on X;, weighted by the histogram of X;. In contrast 
to quantile regression, which requires only one pass through 
the data, MD relies on the ability to estimate O,(y;|X;) 
consistently in a nonparametric first step. 

Figure 7.1.1 plots MD fitted values with a dashed line. The 
quantile regression and MD lines are close, but they are not 
identical because of the implicit weighting by w,(X;, B,) in the 
quantile regression fit. This weighting accentuates the quality 
of the fit at values of X; where y; is more densely distributed 
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Figure 7.1.1 The quantile regression approximation property (adapted from Angrist, Chernozhukovy, and 
Fernandez-Val, 2006). The figure shows alternative estimates of the conditional quantile function of log wages given 
highest grade completed using 1980 Census data, along with the implied weighting function. Panels A-C report 
nonparametric (CQ), quantile regression (QR) and minimum distance (MD) estimates for t = .1,.5,.9. Panels D-F 
show the corresponding weighting functions for QR, as explained in the text. 
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near the CQF. Panels D-F in figure 7.1.1 plot the overall quan- 
tile weights, w,(X;, b+): P(X;), against X;. The panels also 
show estimates of w,(X;, b+), labeled “importance weights,” 
and their density approximations, 1/2 - f,(Q,(y;|X;)|X;). The 
importance weights and the density weights are similar and 
fairly flat. The overall weighting function looks a lot like the 
schooling histogram, and therefore places the highest weight 
on 12 and 16 years of schooling. 


7.1.3 Tricky Points 


The language of conditional quantiles is tricky. Sometimes we 
talk about “quantile regression coefficients at the median,” 
or “effects on those at the lower decile.” But it’s important 
to remember that quantile coefficients tell us about effects on 
distributions, not on individuals. If we discover, for example, 
that a training program raises the lower decile of the wage dis- 
tribution, this does not necessarily mean that someone who 
would have been poor (i.e., at the lower decile without train- 
ing) is now less poor. It only means that those who are poor 
in the regime with training are less poor than the poor would 
be in a regime without training. 

The distinction between making a given set of poor people 
richer and changing what it means to be poor is subtle. This 
distinction has to do with whether we think an intervention 
preserves an individual’s rank in the wage (or other dependent 
variable) distribution. If an intervention is rank-preserving, 
then an increase in the lower decile indeed makes those who 
would have been poor richer, since rank preservation means 
relative status is unchanged. Otherwise, we can only say that 
the poor—defined as the group in the bottom 10 percent of 
the wage distribution, whoever they may be—are better off. 
We elaborate on this point briefly in section 7.2. 

A second tricky point is the transition from conditional 
quantiles to marginal quantiles. A link from conditional to 
marginal quantiles allows us to investigate the impact of 
changes in quantile regression coefficients on overall inequal- 
ity. Suppose, for example, that quantile coefficients fan out 
even further with schooling, beyond what is observed in the 
2000 census. What does this imply for the ratio of upper-decile 
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to lower-decile wages? Alternatively, we can ask: How much 
of the overall increase in inequality (say, as measured by 
the ratio of upper to lower deciles) is explained by increases 
in within-group inequality summarized by the fanning out 
of quantile regression coefficients? These sorts of questions 
turn out to be surprisingly difficult to answer. The difficulty 
has to do with the fact that all conditional quantiles are 
needed to pin down a particular marginal quantile (Machado 
and Mata, 2005). In particular, O,(y;|X;) = X/8, does not 
imply O,(y;) = O,(X;)'B,. This contrasts with the much more 
tractable expectations operator, where, if E(y;|X;) = X‘B, 
then by iterating expectations, we have E(y;) = E(X;)'B. 


Extracting Marginal Quantiles* 


To show the link between conditional quantiles and marginal 
distributions more formally, suppose the CQF is indeed linear, 
so that Q, (Y;|X;) = XjB,. Let Fy(y|X;) = Ply; < y|X;] be the 
conditional CDF of y; given X;, with marginal distribution 
F,(y) = Ply; < y]. The CDF and its inverse are related by 


L 
f 1E; c|X;) < yldt = FylylX)), (7.1.9) 
0 


where F,'(t|X;) is also the CQF, Q+ (Y;|X;). 

In other words, the proportion of the population below y 
conditional on X; is the same as the proportion of conditional 
quantiles that are below y.° Substituting for the CQF inside 
the integral using the linear model, gives 


1 
F,(y[X,) = Í Xeyir, 


Next, we use the law of iterated expectations to get the 
marginal distribution function, F,(y): 


1 
Fy(y) =E i 1[X;b: < ylar (7.1.10) 
0 


>For example, if y is the conditional median, then Fy(y[X;) = .5, and half 
of all conditional quantiles are below y. The relation (7.1.9) can be proved 
formally using the change of variables formula. 
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Finally, marginal quantiles, say, O,(y;) for t € (0,1), come 
from inverting F,(y): 


Q,(y;) = inf {y : Fy(y) > T}. 


An estimator of the marginal distribution replaces the inte- 
gral and expectations with sums in (7.1.10), where the sum 
over quantiles comes from quantile regression estimates at, 
say, every .01 quantile. In a sample of size N, this becomes: 


t=1 


Fy(y) = N7! X (1/100) XC 1[X/6, < y]. 


t=0 


The corresponding marginal quantile estimator inverts Ê (y). 
A number of difficulties arise with this approach in practice. 
For one thing, you have to estimate lots of quantile regres- 
sions. Another is that the asymptotic distribution theory is 
complicated (though not insurmountable; see, Chernozhukov, 
Fernandez-Val, and Melly, 2008). Simplifying the conditional 
to marginal quantile transition is an active research area. 
Gosling, Machin, and Meghir (2000) and Machado and Mata 
(2005) are among the first empirical studies to go from con- 
ditional to marginal quantiles. When the variable of primary 
interest in a quantile regression model is a dummy variable 
such as treatment status and the other regressors are seen as 
controls, a propensity score type of weighting scheme can be 
used to estimate effects on marginal distributions. See Firpo 
(2007) for the exogenous case and Frölich and Melly (2007) 
for a marginalization scheme that works for endogenous treat- 
ment effects models of the sort discussed in the next section. 


7.2 IV Estimation of Quantile Treatment Effects 


The $42,000 question regarding any set of regression estimates 
is whether they have a causal interpretation. This is no less true 
for quantile regression than ordinary least squares. Suppose we 
are interested in estimating the effect of a training program on 
earnings. OLS regression estimates measure the impact of the 
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program on average earnings while quantile regression esti- 
mates can be used to measure the impact of the program on 
median earnings. In both cases, we must worry about whether 
the estimated program effects are contaminated by omitted 
variables bias (OVB). 

Here too, omitted variables problems can be solved using 
instrumental variables, though IV methods for quantile mod- 
els are a relatively new development and not yet as flexible 
as conventional 2SLS. We discuss an approach that captures 
the causal effect of a binary variable on quantiles (i.e., a treat- 
ment effect) using a binary instrument. The quantile treatment 
effects (QTE) estimator for IV, introduced in Abadie, Angrist, 
and Imbens (2002), relies on essentially the same assumptions 
as the LATE framework for average causal effects. The result 
is an Abadie-type weighting estimator of the causal effect of 
treatment on quantiles for compliers.® 

Our discussion of the QTE estimator is based on an addi- 
tive model for conditional quantiles, so that a single treatment 
effect is estimated in a model with covariates. The result- 
ing estimator simplifies to Koenker and Bassett (1978) linear 
quantile regression when there is no instrumenting. The rela- 
tionship between QTE and quantile regression is therefore 
analogous to that between conventional 2SLS and OLS when 
the regressor of interest is a dummy. 

The parameters of interest are defined as follows. For t € 
(0,1), we assume there exist œ, and £, such that 


O,(Y;|Xi, Dj, D1; > Doi) = &:D; + X;r, (7.2.1) 


where O,(yY;|Xj,Dj;,D1; > Do;) denotes the t-quantile of y; 
given X; and D; for compliers. Thus, w, and $, are quantile 
regression coefficients for compliers. 

Recall that D; is independent of potential outcomes condi- 
tional on X; and Dj; > Do;, as we discussed in (4.5.2). The 
parameter a, in this model therefore gives the difference in 


For an alternative approach, see Chernozhukov and Hansen (2005), which 
allows for regressors of any type (i-e., not just dummies) but invokes a rank- 
similarity assumption that is unnecessary in the QTE framework. 


Quantile Regression 285 


the conditional-on-X; quantiles of Y1; and Yo; for compliers. 
In other words, 


O-(¥4;1Xj, D1; > Doi) — Or(Yoi|X:, D1; > Doz) = a, (7.2.2) 


This tells us, for example, whether a training program changed 
the conditional median or lower decile of earnings for com- 
pliers. Note that the parameter œ, does not tell us whether 
treatment changed the quantiles of the unconditional distri- 
butions of Y1; and Yo;. For that, we have to integrate families 
of quantile regression results using a procedure like the one 
described in section 7.1.3. 

It also bears emphasizing that a, is not the conditional 
quantile of the individual treatment effects, (Y1; — Yo;). You 
might want to know, for example, whether the median treat- 
ment effect is positive. Unfortunately, questions like this 
are very hard to answer without making strong assumptions 
such as rank-invariance.’ Even a randomized trial with per- 
fect compliance fails to reveal the distribution of (Y1; — Yo;). 
Although a difference in averages is the same as an average 
difference, other features of the distribution of Y1; — Yo; are 
hidden because we never get to see both yy; and Yo; for any 
one person. The good news for applied econometricians is 
that differences in distributions are usually more important 
than the distribution of treatment effects because compar- 
isons of social welfare typically require only the distributions 
of Yı; and yYo;, and not the distribution of their difference 
(see, e.g., Atkinson, 1970). This point can be made with- 
out reference to quantiles. When evaluating an employment 
program, we are inclined to view the program favorably if 
it increases overall employment rates. In other words, we 
are happy if the average Y1; is higher than the average Yj. 
The number of individuals who gain jobs (Y1; — Yo; = 1) or 
lose jobs (¥1;—Yo; = —1) should be of secondary interest, 
since a good program will necessarily have more gainers than 
losers. 


7In this context, rank-invariance means Y1; and yo; are related by an 
invertible function. See, for example, Heckman, Smith, and Clements (1997). 
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7.2.1 The OTE Estimator 


The QTE estimator is motivated by the observation that quan- 
tile regression coefficients for compliers can (theoretically) be 
estimated by running quantile regressions in the population of 
compliers. We cannot list the compliers in a given data set, but 
as in section 4.5.2, we can use the Abadie kappa theorem to 
find them. Specifically, 


(a7, Br) = arg minE{p: (Y; — aD; — X;b)|D1; > Doi} 


a,b 
= arg minE{k;p,(y; — aD; — X‘b)}, (7.2.3) 
a,b 
where 
D,(1 —Z;) (1 — Dj)Z; 


k;=1 


1—P(z;=1|X;) P(z; = 1|X;)’ 


as before. The QTE estimator is the sample analog of (7.2.3). 

A number of practical issues arise when implementing QTE. 
First, «; must be estimated, and the sampling variance induced 
by this first-step estimation should be reflected in the relevant 
asymptotic distribution theory. Abadie, Angrist, and Imbens 
(2002) derive the limiting distribution of the sample analog 
of (7.2.3) when «; is estimated nonparametrically. In practice, 
however, it is easier to bootstrap the whole procedure (i.e., 
beginning with the construction of estimated kappas) than to 
use the asymptotic formulas. 

Second, x; is negative when D; Æ z;. The kappa-weighted 
quantile regression minimand is therefore nonconvex and, 
unlike the regular quantile regression estimator, does not have 
a linear programming representation. This problem can be 
solved by minimizing 


E{E[k;|¥;, Di, Xi]r(¥; — aD; — X;b)} (7.2.4) 


instead. This minimand is derived by iterating expectations 
in (7.2.3). The practical difference between (7.2.3) and (7.2.4) 
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is that the term 
E[«i|¥i, Di, Xi] = P[D1j > Do;lY;, Di, Xi] 


is a probability and therefore between zero and one.® A further 
simplification comes from the fact that 


D;(1 — E[z;l¥;, Dj = 1, X;) 
1 — Pz = 1X) 
(1 —D,)E[z;l¥;, Dj = 0, X;) 
P(z; = 1|X;) 


EfkilY;, Di, Xi] = 1 


(7.2.5) 


Angrist (2001) used this to implement QTE with probit mod- 
els for E[z;|Y;,D;,X;] estimated separately in the D; = 0 and 
D; = 1 subsamples, constructing E[«;|Y;,D;,X;] using (7.2.5), 
and then trimming any of the resulting estimates of 
E[k;|¥;,D;,X;] that are outside the unit interval. The result- 
ing non-negative first-step estimates of E[«;|y;,D;,X;] can be 
plugged in as weights using Stata’s qreg command to con- 
struct weighted quantile regression estimates in a second step.” 


Estimates of the Effect of Training on the Quantiles 
of Trainee Earnings 


The Job Training Partnership Act was a large federal program 
that provided subsidized training to disadvantaged Ameri- 
can workers in the 1980s. JIPA services were delivered at 
649 sites, also called Service Delivery Areas (SDAs), located 


The expectation of «; is a probability because «; “finds compliers.” A 
formal statement of this result appears in Abadie, Angrist, and Imbens (2002; 
lemma 3.2). 

Step-by-step, it goes like this: 


1. Probit z; on Y; and X; separately in the pj =0 and pj =1 
subsamples. Save these fitted values. 

2. Probit z; on X; in the whole sample. Save these fitted values. 

3. Construct E[kj|yi,Dj,Xi] by plugging the two sets of fitted values 
into (7.2.5). Set anything less than zero to zero and anything greater 
than one to one. 

4. Use these kappas to weight quantile regressions. 

5. Bootstrap this whole procedure to construct standard errors. 
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throughout the country. The original study of the labor- 
market impact of JTPA services was based on a sample of 
men and women for whom continuous data on earnings (from 
either state unemployment insurance records or two follow-up 
surveys) were available for at least 30 months after ran- 
dom assignment.!° There are 5,102 adult men with 30-month 
earnings data in the sample. 

In our notation, Y; is 30-month earnings, D; indicates enroll- 
ment for JTPA services, and z; indicates the randomly assigned 
offer of JTPA services. A key feature of most social exper- 
iments, as with many randomized trials of new drugs and 
therapies, is that some participants decline the intervention 
being offered. In the JIPA, those offered services were not 
compelled to participate in training. Consequently, although 
the offer of subsidized training was randomly assigned, only 
about 60 percent of those offered training actually received 
JIPA services. Treatment received is therefore partly self- 
selected and likely to be correlated with potential outcomes. 
On the other hand, as discussed in 4.4.3, the randomized offer 
of training provides a good instrument for training received, 
since the two are obviously correlated and the offer of treat- 
ment is independent of potential outcomes. Moreover, because 
of the very low percentage of individuals receiving JTPA ser- 
vices in the control group (less than 2 percent), effects for 
compliers can be interpreted as effects on those who were 
treated (as discussed in 4.4.3: LATE equals the effect on the 
treated when there are no always-takers). 

Since training offers were randomized in the National JTPA 
Study, covariates (X;) are not required to consistently estimate 
effects on compliers. Even in experiments like this, how- 
ever, it’s customary to control for covariates to correct for 
chance associations between treatment status and applicant 
characteristics and to increase precision (see chapter 2). The 
covariates used here are baseline measures from the JIPA 
intake process. They include dummies for black and Hispanic 
applicants, a dummy for high school graduates (including 
GED holders), dummies for married applicants, five age-group 


10See Bloom et al. (1997) and Orr et al. (1996). 
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dummies, and a dummy for whether the applicant worked at 
least 13 weeks in the year preceding random assignment. Also 
included are dummies for the original recommended service 
strategy (classroom training, on-the-job training, job search 
assistance, other) and a dummy for whether earnings data 
are from the second follow-up survey. Since these covariates 
mostly summarize subjects’ demographic and socioeconomic 
background, we can think of the quantile analysis as telling us 
how the JTPA experiment affected the earnings distribution 
within demographic and socioeconomic groups. 

As a benchmark, OLS and conventional instrumental vari- 
ables (2SLS) estimates of the impact of training on adult men 
are reported in the first column of table 7.2.1. The OLS train- 
ing coefficient is a precisely estimated $3,754. This is the 
coefficient on D; in a regression of Y; on D; and X;. These 
estimates ignore the fact that trainees are self-selected. The 
2SLS estimates in table 7.2.1 use the randomized offer of treat- 
ment z; as an instrument for D;. The 2SLS estimate is $1,593 
with a standard error of $895, less than half the size of the 
corresponding OLS estimate. 

Quantile regression estimates show that the gap in quantiles 
by trainee status is much larger (in proportionate terms) below 
the median than above it. This can be seen in the top right-hand 
columns of table 7.2.1, which reports quantile regression esti- 
mates for the .15, .25, .5, .75, and .85 quantiles. Specifically, 
the .85 quantile of trainee earnings is about 13 percent higher 
than the corresponding quantile for non-trainees, while the .15 
quantile is 136 percent higher. Like the OLS estimates in the 
table, these quantile regression coefficients do not necessarily 
have a causal interpretation. Rather, they provide a descrip- 
tive comparison of the earnings distributions of trainees and 
nontrainees. 

QTE estimates of the effect of training on median earn- 
ings are similar in magnitude though less precise than the 
benchmark 2SLS estimates. On the other hand, the QTE 
estimates exhibit a pattern very different from the quan- 
tile regression estimates. The estimates at low quantiles are 
substantially smaller than the corresponding quantile regres- 
sion estimates, and they are small in absolute terms. For 
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TABLE 7.2.1 
Quantile regression estimates and quantile treatment effects from the JIPA 
experiment 
A. OLS and Quantile Regression Estimates 
Quantile 
Variable OLS AS 25 50 75 85 
Training effect 3,754 1,187 2,510 4,420 4,678 4,806 
536 205) (356 (651 (937) (1,055 
% Impact of training 21.2 135.6 75.2 34.5 17.2 13.4 
High school or GED 4,015 339 1,280 3,665 6,045 6,224 
571 186) (305 (618 (1,029 (1,170 
Black —2, 354 134 500 2,084 3,576 3,609 
626 194 (324 (684) (1087) (1,331 
Hispanic 251 91 278 925 —877 —85 
883 315 (512 (1,066) (1,769 (2,047 
Married 6,546 587 1,964 7,113 10,073 11,062 
629 222 (427 (839) (1,046) (1,093 
Worked < 13 6,582 1,090 3,097 7,610 9,834 9,951 
weeks in past year (566 190 (339 (665) (1,000 (1,099 
Constant 9,811 —216 365 6,110 14,874 21,527 
(1,541) 468 (765 (1,403) (2,134 (3,896 
B. 2SLS and QTE Estimates 
Quantile 
Variable 2SLS AS 25 50 75 85 
Training effect 1,593 121 702 1,544 3,131 3,378 
895 475) (670) (1,073) (1,376) (1,811 
% Impact of training 8.55 5.19 12.0 9.64 10.7 9.02 
High school or GED 4,075 714 1,752 4,024 5,392 5,954 
573 429) (644 (940) (1,441) (1,783 
Black —2, 349 171 377 2,656 4,182 3,523 
625 439) (626 (1,136 (1,587 (1,867 
Hispanic 335 328 1,476 1,499 379 1,023 
888 757) (1,128) (1,390) (2,294) (2,427 
Married 6,647 1,564 3,190 7,683 9,509 10,185 
627 596) (865 (1,202 (1,430 (1,525 
Worked <13 6,575 1,932 4,195 7,009 9,289 9,078 
weeks in past year 567 442) (664 (1,040 (1,420 (1,596 
Constant 10,641 —134 1,049 7,689 14,901 22,412 
(1,569) (1,116) (1,655) (2,361 (3,292 (7,655 


Notes: The table reports OLS, quantile regression, 2SLS, and QTE estimates of the 
effect of training on earnings (adapted from Abadie, Angrist, and Imbens, 2002). The 
sample includes 5,102 adult men. Assignment status is used as an instrument for training 
status in Panel B. In addition to the covariates shown in the table, all models include 
dummies for service strategy recommended and age group, and a dummy indicating data 
from a second follow-up survey. Robust standard errors are reported in parentheses. 
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example, the QTE estimate of the effect on the .15 quantile 
is $121, while the corresponding quantile regression estimate 
is $1,187. Similarly, the QTE estimate of the effect on the 
.25 quantile is $702, while the corresponding quantile regres- 
sion estimate is $2,510. Unlike the results at low quantiles, 
however, the QTE estimates of effects on earnings above 
the median are large and statistically significant (though still 
smaller than the corresponding quantile regression estimates). 

The result that JTPA training for adult men did not raise 
the lower quantiles of their earnings distribution is the most 
interesting finding arising from this analysis. This suggests that 
the quantile regression estimates in the top half of table 7.2.1 
are contaminated by positive selection bias. One response 
to this finding might be that few JTPA applicants were very 
well off, so that distributional effects within applicants are of 
less concern than the fact that the program appears to have 
helped many applicants overall. However, the upper quantiles 
of earnings were reasonably high for adults who participated 
in the National JTPA Study. Increasing the upper tail of the 
trainee earnings distribution is therefore unlikely to have been 
a high priority for policy makers. 
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Nonstandard Standard Error Issues 


A a 


We have normality. I repeat, we have normality. 
Anything you still can’t cope with is therefore your own 
problem. 

Douglas Adams, The Hitchhiker’s Guide to the Galaxy 


oday, software packages routinely compute asymptotic 

standard errors derived under weak assumptions about 

the sampling process or underlying model. For example, 
you get regression standard errors based on formula (3.1.7) 
using the Stata option robust. Robust standard errors 
improve on old-fashioned standard errors because the result- 
ing inferences are asymptotically valid when the regression 
residuals are heteroskedastic, as they almost certainly are when 
regression approximates a nonlinear conditional expectation 
function (CEF). In contrast, old-fashioned standard errors are 
derived assuming homoskedasticity. The hangup here is that 
estimates of robust standard errors can be misleading when 
the asymptotic approximation that justifies these estimates is 
not very good. The first part of this chapter looks at the failure 
of asymptotic inference with robust standard error estimates 
and some simple palliatives. 

A pillar of traditional cross section inference—and the dis- 
cussion in section 3.1.3—is the assumption that the data are 
independent. Each observation is treated as a random draw 
from the same population, uncorrelated with the observa- 
tion before or after. We understand today that this sampling 
model is unrealistic and potentially even foolhardy. Much 
as in the time series studies common in macroeconomics, 
cross section analysts must worry about correlation between 
observations. The most important form of dependence arises 
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in data with a group structure—for example, the test scores 
of children observed within classes or schools. Children in the 
same school or class tend to have test scores that are corre- 
lated, since they are subject to some of the same environmental 
and family background influences. We call this correlation the 
clustering problem, or the Moulton problem, after Moulton 
(1986), who made it famous. A closely related problem is 
correlation over time in the data sets commonly used to imple- 
ment differences-in-differences (DD) estimation strategies. For 
example, studies of state-level minimum wages must confront 
the fact that state average employment rates are correlated over 
time. We call this the serial correlation problem, to distinguish 
it from the Moulton problem. 

Researchers plagued by clustering and serial correlation also 
have to confront the fact that the simplest fixups for these 
problems, like Stata’s cluster option, may not be very good. 
The asymptotic approximation relevant for clustered or seri- 
ally correlated data relies on a large number of clusters or time 
series observations. Alas, we are not always blessed with many 
clusters or long time series. The resulting inference problems 
are not always insurmountable, though often the best solu- 
tion is to get more data. Econometric fixups for clustering 
and serial correlation are discussed in the second part of this 
chapter. Some of the material in this chapter is hard to work 
through without matrix algebra, so we take the plunge and 
switch to a mostly matrix motif. 


8.1 The Bias of Robust Standard Error Estimates* 


In matrix notation 
—1 
B= | ox | $O Xai =(X'Xy X'y, 


where X is the Nxk matrix with rows X; and y is the 
N x 1 vector of y;’s. We saw in section 3.1.3 that 6 has an 
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asymptotically normal distribution. We can write: 
VN(B — B) ~ N(0, 2) 


where Q is the asymptotic covariance matrix and fp = 
E[X;X‘]-E[X;y;]. Repeating (3.1.7), the formula for Q in this 
case is 


Q, = E[X;X.] 'E[X;X/e}]E[X;X/ I t, (8.1.1) 


where e; = Y; — X;B. When residuals are homoskedastic, the 
covariance matrix simplifies to Q, = o?E[X;X/] t, where 
o? = Efe?]. 

We are concerned here with the bias of robust standard error 
estimates in independent samples (i.e., no clustering or serial 
correlation). To simplify the derivation of bias, we assume 
that the regressor vector can be treated as fixed, as it would 
be if we sampled stratifying on X;. Nonstochastic regressors 
gives a benchmark sampling model that is often used to look 
at finite-sample distributions. It turns out that we miss little 
of theoretical importance by making this assumption, while 
simplifying the derivations considerably. 

With fixed regressors, we have 


E a L ey A 
r=\ N N N > (8.1.2) 


Y = Efee'] = diag(yi) 


where 


is the covariance matrix of residuals. Under homoskedasticity, 
Yi = o? for alli and we get 


T BE [X'X\! 
0. =a : 
N 


Asymptotic standard errors are given by the square root of the 
diagonal elements of Q, and Q,, after removing the asymptotic 
normalization by dividing by N. 

In practice, the pieces of the asymptotic covariance matrix 
are estimated using sample moments. An old-fashioned or 
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conventional covariance matrix estimator is 
x oe 
OQ. = (X'X) “16? = (X'X) by =) 
where é; = Y; — X; £ is the estimated regression residual and 


xy A 
ô“ = ) = 
N 


estimates the residual variance. The corresponding robust 
covariance matrix estimator 1s 


x22 
©, = N(X' X)! (x =) (X'X)71, (8.1.3) 


We can, think of the middle term as an estimator of the form 
ki i 
By the law of large numbers and Slutsky’s theorem, NQ- 
converges in probability to Q,, while N Q, converges to Q,. 
But in finite samples, both variance estimators are biased. The 
bias in & is well-known from classical least squares theory and 
easy to correct. Less appreciated is the fact that if the resid- 
uals are homoskedastic, the robust estimator is more biased 
than the conventional estimator, perhaps a lot more. From 
this we conclude that robust standard errors can be more mis- 
leading than conventional standard errors in situations where 
heteroskedasticity is modest. We also propose a rule of thumb 
that uses the maximum of old-fashioned and robust standard 
errors to avoid gross misjudgments of precision. 
Our analysis begins with the bias of &.. With nonstochastic 
regressors, we have 


a EE ae E(è?) 
ELÊ.] = (X'X)7'6? = (X'X) E =<). 


i: _ ^2 . 
, where Y; = ê estimates yj. 


To analyze E[é?], start by expanding ê = y — XÊ: 
ê= y— X(X'X) t X'y = [In — X(X’X) I X'](XB +e) = Me, 


where e is the vector of population residuals, M = In — 
X(X'X)~!X’ is a nonstochastic residual-maker matrix with 
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ith row m’, and Iy is the N x N identity matrix. Then ê; = m'e, 
and 


E(é;) = E(miee'm;) 


= mj. 
To simplify further, write m; = €;—;, where £; is the ith 
column of In and h; = X(X’X)~'X;, the ith column of the 
projection matrix H = X(X'X)~!X’. Then 
E(é7) = (€; — hj) Y(t — hi) 
= Wi — 2 Wiha + hihi, (8.1.4) 
where h;;, the ith diagonal element of H, satisfies 


hy = bib; = X(X' Xy 'X;. (8.1.5) 


Parenthetically, h; is called the leverage of the ith observa- 
tion. Leverage tells us how much pull a particular value of X; 
exerts on the regression line. Note that the ith fitted value (ith 
element of Hy) is 


¥& = biy = haxit $ bix. (8.1.6) 
[Fi 

A large h; means that the ith observation has a large impact on 
the ith predicted value. In a bivariate regression with a single 
regressor, Xj, 
1 (x; — x)? 
-N 5 (xj — x) 
This shows that leverage increases when x; is far the mean. In 
addition to (8.1.6), we know that h;; is a number that lies in 


hi (8.1.7) 


N 

the interval [0, 1] and that 5 hj; = K, the number of regressors 
i=1 

(see, e.g., Hoaglin and Welsch, 1978).! 


N 
1The property D hii = K comes from the fact that H is idempotent, and so 
i=1 
has trace equal to rank. We can also use (8.1.7) to verify that in a bivariate 
N 


regression, D hbi 2. 
i=1 
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Suppose residuals are homoskedastic, so that y; = o*. Then 
(8.1.4) simplifies to 


E(é?) = 07 [1 — 2b; + bth] = 07 (1 — hi) < o? 


So Ê. tends to be too small. Using the properties of h;;, we can 
go one step further: 


oe) Ss 3 hi _ 2 (5E) 


Thus, the bias in Ê, can be fixed by a simple degrees-of- 
freedom correction: divide by N—x instead of N in the 
formula for 67. This correction is used by default in most 
regression software. 

We now want to show that under homoskedasticity, the bias 
in Q, is likely to be worse than the bias in Ê.. The expected 
value of the robust covariance matrix estimator is 


E[22,] = N(X'X)7! (x xe) (X’X)"!, (8.1.8) 


where E(é?) is given by (8.1.4). Under homoskedasticity, 
Yi = o? and we have E(é?) = 07(1—hj) as in Q,. It’s clear, 
therefore, that the bias in ê? tends to pull robust standard 
errors down. The general expression, (8.1.8), is hard to evalu- 
ate, however. Chesher and Jewitt (1987) show that as long as 
there is not “too much” heteroskedasticity, robust standard 
errors based on Ê, are indeed biased downward.” 

How do we know that &, is likely to be more biased 
than {2.2 Partly this comes from Monte Carlo evidence (e.g., 
MacKinnon and White, 1985, and our own small study, dis- 
cussed below). We also prove this here for a bivariate example, 
where the single regressor, %;, is assumed to be in deviations- 


from-means form, so there is a single coefficient. In this case, 
— Lexi 


the estimator of interest is ĝi = aE 
i 


and the leverage is 


?In particular, as long as the ratio of the largest y; to the smallest y; is less 
than 2, robust standard errors are biased downward. 
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z2 . Tin 
his = A (we lose the nN term in (8.1.7) by partialing out 


pe 
N 


the constant). Let s2? = 
estimator, we have 


A o? (1 = hi) o? 1 
Ble] = zal N EA; x 


so the bias here is small. A simple calculation using (8.1.8) 
shows that under homoskedasticity, the robust estimator has 
expectation: 


i o? șa (1-b) (x 
a= Dow a) 
o? o? 
= yg 2 Ag ARE 


The = of Ê, is therefore worse than the bias of 2, if 
Eh $ as it is by Jensen’s inequality anes the aaa 
has Cohstant leverage, in which case hj; = x for all i.3 


We can reduce the bias in Q, by trying to get a better estima- 
tor of y;, say W;. The estimator , sets W; = ê? as proposed by 
White (1980a) and our starting point in this section: The resid- 
ual variance estimators discussed in MacKinnon and White 
(1985) include this and three others: 


HCo: Wj = è 
a N 
oe) 
1:Wi= Woe? 
3Think of hj as a random variable with a uniform distribution in the sample. 
Then Sh 
i 1 
Eh = N = N > 
and 


be 1\? 
Eih] = Bs > (Ethi))? = (x) 


by Jensen’s inequality unless h; is constant. Therefore 2h > T The 
constant leverage case occurs when (X;)? is constant. 
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k i AES 
HC) > y= Er 
1 


HC; : Wi = aah 


HC, isa simple degrees of freedom correction as is used for Q2¢. 
HC) uses the leverage to give an unbiased estimate of the vari- 
ance of the ith residual when the residuals are homoskedastic, 
while HC; approximates a jackknife estimator.‘ In the appli- 
cations we’ve seen, the estimated standard errors tend to get 
larger as we go down the list from HCo to HC3, but this is not 
a theorem. 


Time-Out for the Bootstrap 


Bootstrapping is a resampling scheme that offers an alterna- 
tive to inference based on asymptotic formulas. A bootstrap 
sample is a sample drawn from our own data. In other words, 
if we have a sample of size N, we treat this sample as if it 
were the population and draw repeatedly from it (with replace- 
ment). The bootstrap sampling distribution is the distribution 
of an estimator across many draws of this sort. Intuitively, 
we expect the sampling distribution constructed by sampling 
from our own data to provide a good approximation to the 
sampling distribution we are after. 

There are many ways to bootstrap regression estimates. The 
simplest is to draw pairs of {y;, X;} values, sometimes called 
the “pairs bootstrap” or a “nonparametric bootstrap.” Alter- 
natively, we can keep the X; values fixed, draw from the 
distribution of residuals (é;), and create a new estimate of the 
dependent variable based on the predicted value and the resid- 
ual draw for each observation. This procedure, which is a type 
of parametric bootstrap, mimics a sample drawn with non- 
stochastic regressors and ensures that X; and the regression 


4A jackknife variance estimator estimates sampling variance from the 
empirical distribution generated by omitting one observation at a time. Stata 
computes HC1, HC2, and HC3. You can also use a trick suggested by Messer 


and White (1984): divide y; and X; by yw; and instrument the transformed 


model by X;/y i for your preferred choice of yi. 
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residuals are independent. On the other hand, we don’t want 
independence if we’re interested in standard errors under het- 
eroskedasticity. An alternative residual bootstrap, called the 
wild bootstrap, draws X; ĝ + ĉ; (which, of course, is just the 
original y;) with probability 0.5, and X/A — ĉ; otherwise (see, 
e.g., Mammen, 1993, and Horowitz, 1997). This preserves 
the relationship between residual variances and X; observed 
in the original sample, while imposing mean-independence of 
residuals and regressors, a restriction that improves bootstrap 
inference when true. 

Bootstrapping is useful as a computer-intensive but other- 
wise straightforward calculator for asymptotic standard 
errors. The bootstrap calculator is especially useful when the 
asymptotic distribution of an estimator is hard to compute 
or involves a number of steps (e.g., the asymptotic distribu- 
tions of the quantile regression and quantile treatment effects 
estimates discussed in chapter 7 require the estimation of den- 
sities). Typically, however, we have no problem deriving or 
evaluating asymptotic formulas for the standard errors of OLS 
estimates. 

More relevant in this context is the use of the bootstrap 
to improve inference. Improvements in inference potentially 
come in two forms: (1) a reduction in finite-sample bias in esti- 
mators that are consistent (for example, the bias in estimates 
of robust standard errors) and (2) inference procedures which 
make use of the fact that the bootstrap sampling distribution 
of test statistics may be closer to the finite-sample distribu- 
tion of interest than the relevant asymptotic approximation. 
These two properties are called asymptotic refinements (see, 
e.g., Horowitz, 2001). 

Here we are mostly interested in use of the bootstrap for 
asymptotic refinement. The asymptotic distribution of regres- 
sion estimates is easy enough to compute, but we worry that 
the traditional robust covariance estimator (HCo) is biased. 
The bootstrap can be used to estimate this bias, and then, by a 
simple transformation, to construct standard error estimates 
that are less biased. However, for now at least, bootstrap bias 
correction of regression standard errors is not often used in 
empirical practice, perhaps because the bias calculation is not 
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automated and perhaps because bootstrap bias corrections 
introduce extra variability. Also, for simple estimators like 
regression coefficients, analytic bias corrections such as HC 
and HC; are readily available (e.g., in Stata). 

An asymptotic refinement can also be obtained for hypoth- 
esis tests (and confidence intervals) based on statistics that 
are asymptotically pivotal. These are statistics that have 
asymptotic distributions that do not depend on any unknown 
parameters. An example is a t-statistic: this is asymptoti- 
cally standard normal. Regression coefficients are not asymp- 
totically pivotal; they have an asymptotic distribution that 
depends on the unknown residual variance. To refine infer- 
ence for regression coefficients, you calculate the t-statistic in 
each bootstrap sample and compare the analogous t-statistic 
from your original sample to this bootstrap “t-distribution.” 
A hypothesis is rejected if the absolute value of the original t- 
statistic is above, say, the 95th percentile of the absolute values 
from the bootstrap distribution. 

Theoretical appeal notwithstanding, as applied researchers, 
we don’t like the idea of bootstrapping pivotal statics very 
much. This is partly because we’re not only (or even primarily) 
interested in formal hypothesis testing: we like to see the stan- 
dard errors in parentheses under our regression coefficients. 
These provide a summary measure of precision that can be 
used to construct confidence intervals, compare estimators, 
and test any hypothesis that strikes us, now or later. In our 
view, therefore, practitioners worried about the finite-sample 
behavior of robust standard errors should focus on bias cor- 
rections like HC, and HC3. As we show below, for moderate 
heteroskedasticity at least, an inference strategy that uses the 
larger of conventional and bias-corrected standard errors often 
seems to give us the best of both worlds: reduced bias with a 
minimal loss of precision. 


An Example 


For further insight into the differences between robust covari- 
ance estimators, we analyze a simple but important example 
that has featured in earlier chapters in this book. Suppose you 
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are interested in an estimate of £1 in the model 
Y; = Po + 1D; + &, (8.1.9) 


where D; is a dummy variable. The OLS estimate of £1 is the 

difference in means between those with D; switched on and off. 

Denoting these subsamples by the subscripts 1 and 0, we have 
Bi =¥1—Yo. 

For the purposes of this derivation we think of D; as nonran- 

dom, so that ` p; = Ni; and )°(1—D,) = No are fixed. Let 

r= Ni /N. E 

We know something about the finite-sample behavior of £4 
from statistical theory. If y; is normal with equal but unknown 
variance in both the p; = 1 and p; = 0 populations, then the 
conventional t-statistic for 8, has a t-distribution. This is the 
classic two-sample t-test. Heteroskedasticity in this context 
means that the variances in the D; = 1 and D; = 0 popula- 
tions are different. In this case, the testing problem in small 
samples becomes surprisingly difficult: the exact small-sample 
distribution for even this simple problem is unknown.’ The 
robust variance estimators HCo—HC3 give asymptotic approx- 
imations to the unknown finite-sample distribution for the case 
of unequal variances. 

The differences between HCo, HC;, HC, and HC; are dif- 
ferences in how the sample variances in the two groups defined 
by D; are processed. Define S = pa (Y; —Y¥;)* for j = 0,1. 
The leverage in this example is 


b No if pj; = 0 
"(è fos" 


Using this, it’s straightforward to show that the five variance 
estimators we’ve been discussing are 


N eo 1 (2481) 


C tional : = 
onventiona NN, oo) Nalon \N 2 


>This is called the Behrens-Fisher problem (see, e.g., DeGroot and 
Schervish, 2001, chap. 8). 
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HCo: — + —> 
N S$ St 
HC): aie + 1 ) 
1 


HC): 


+ 
No(No—1) Ni(Ni — 1) 
s2 È s? 
(No—1)} (Ni —1)? 


HC3: 


The conventional estimator pools subsamples: this is efficient 
when the two variances are the same. The White (1980a) 
estimator, HCo, adds separate estimates of the sampling vari- 
ances of the means, using the consistent (but biased) variance 


s 
estimators, x. The HC) estimator uses unbiased estimators 


of the sample variance for each group, since it makes the 
correct degrees-of-freedom correction. HC, makes a degrees- 
of-freedom correction outside the sum, which will help but is 
generally not quite correct. Since we know HC; to be the unbi- 
ased estimate of the sampling variance under homoskedastic- 
ity, HC3 must be too big.® Note that with r = 0.5, a case where 
the regression design is said to be balanced, the conventional 
estimator equals HC, and all five estimators differ little. 

A small Monte Carlo study based on (8.1.9) illustrates the 
pluses and minuses of alternative estimators and the extent to 
which a simple rule of thumb goes a long way toward amelio- 
rating the bias of the HC class. We choose N = 30 to highlight 
small sample issues, and r = 0.10 (10 percent treated), which 
implies þh; = i if D; = 1 and þh; = 5 if b; = 0. This is a highly 
unbalanced design. We draw residuals from the distributions: 


N(0,0?) if pj = 0 
N(0,1) ifp;=1 


Ep 


and report results for three cases. The first has lots of het- 
eroskedasticity, with ø = 0.5, while the second has relatively 


®In this simple example, HC2 is unbiased whether or not residuals are 
homoskedastic. 
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little heteroskedasticity, with o = 0.85. No heteroskedasticity 
is the benchmark case. 

Table 8.1.1 displays the results. Columns 1 and 2 report 
means and standard deviations of the various standard error 
estimates across 25,000 replications of the sampling experi- 
ment. The standard deviation of A; is the sampling variance 
we are trying to measure. With lots of heteroskedasticity, as 
in the upper panel of the table, conventional standard errors 
are badly biased and, on average, only about half the size of 
the Monte Carlo sampling variance that constitutes our target. 
On the other hand, while the robust standard errors perform 
better, except for HC3, they are still too small.’ 

The standard errors are themselves estimates and have con- 
siderable sampling variability. Especially noteworthy is the 
fact that the robust standard errors have much higher sam- 
pling variability than the conventional standard errors, as can 
be seen in column 2.° The sampling variability of estimated 
standard errors further increases when we attempt to reduce 
bias by dividing the residuals by 1 — h; (HC2) or (1 — hj)? 
(HC3). The worst case is HC3, with a standard deviation about 
50 percent above the standard deviation of the White (1980a) 
standard error, HC. 

The last two columns in the table show empirical rejection 
rates in a nominal 5 percent test for the hypothesis 6, = 0, 
the population parameter in this case. The test statistics are 
compared with a normal distribution and to a t-distribution 
with N — 2 degrees of freedom. Rejection rates are far too high 
for all tests, even with HC3. Using a t-distribution rather than 
a normal distribution helps only marginally. 


7 Although HC; is an unbiased estimator of the sampling variance, the mean 
of the HC) standard errors across sampling experiments (0.52) is still below 
the standard deviation of ĝ4 (0.59). This comes from the fact that the standard 
error is the square root of the sampling variance, the sampling variance is itself 
estimated and hence has sampling variability, and the square root is a concave 
function. 

8The large sampling variance of robust standard error estimators is noted 
by Chesher and Austin (1991). Kauermann and Carroll (2001) propose an 
adjustment to confidence intervals to correct for this. 
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TABLE 8.1.1 
Monte Carlo results for robust standard error estimates 


Empirical 5% Rejection Rates 


Mean Standard Normal t 
Deviation 
Parameter Estimate (1) (2) (3) (4) 
A. Lots of heteroskedasticity 
Bi —.001 586 
Standard Errors 
Conventional 331 052 278 257 
HCo 417 203 .247 231 
HC, 447 218 223 208 
HC, 523 .260 177 164 
HC3 636 321 .130 120 
max(HCo, Conventional) 448 1:72. .188 171 
max(HC1, Conventional) 473 .190 173 157 
max(HC2, Conventional) 542 238 141 128 
max(HC3, Conventional) 649 305 .107 .097 
B. Little heteroskedasticity 
Bi .004 .600 
Standard Errors 
Conventional .520 .070 098 .084 
HCo 441 .193 217 202 
HC, 473 .207 194 179 
HC, 546 250 .156 143 
HC3 .657 .312 114 104 
max(HCo, Conventional) 562 121 .083 -070 
max(HC1, Conventional) .578 138 .078 .067 
max(HC>2, Conventional) .627 .186 .067 .057 
max(HC3, Conventional) .713 .259 .053 .045 
C. No heteroskedasticity 
By —.003 611 
Standard Errors 
Conventional .604 .081 .061 -050 
HCo 453 .190 209 193 
HC, 486 203 185 71 
HC) 557 247 150 .136 
HC3 .667 309 .110 .100 
max(HCo, Conventional) .629 .109 055 .045 
max(HCy, Conventional) .640 122 053 .044 
max(HC2, Conventional) .679 .166 .047 -039 
max(HC3, Conventional) .754 237 .039 .031 


Notes: The table reports results from a sampling experiment with 25,000 replica- 
tions. Columns 1 and 2 shows the mean and standard deviation of estimated standard 
errors, except for the first row in each panel which shows the mean and standard devi- 
ation of ĝi. The model is as described by (8.1.9), with 61 = 0, r = .1, N = 30, and 
heteroskedasticity as indicated in the panel headings. 
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The results with little heteroskedasticity, reported in the sec- 
ond panel, show that conventional standard errors are still too 
low; this bias is now on the order of 15 precent. HCp and HC, 
are also too small, about as before in absolute terms, though 
they now look worse relative to the conventional standard 
errors. The HC, and HC; standard errors are still larger than 
the conventional standard errors, on average, but empirical 
rejection rates are higher for these two than for conventional 
standard errors. This means the robust standard errors are 
sometimes too small “by accident,” an event that happens 
often enough to inflate rejection rates so that they exceed the 
conventional rejection rates. 

One lesson we can take away from this is that robust 
standard errors are no panacea. They can be smaller than con- 
ventional standard errors for two reasons: the small sample 
bias we have discussed and their higher sampling variance. 
We therefore take empirical results where the robust standard 
errors fall below the conventional standard errors as a red flag. 
This is very likely due to bias or a chance occurrence that is bet- 
ter discounted. In this spirit, the maximum of the conventional 
standard error and a robust standard error may be the best 
measure of precision. This rule of thumb helps on two counts: 
it truncates low values of the robust estimators, reducing 
bias, and it reduces variability. Table 8.1.1 shows the empir- 
ical rejection rates obtained using max(HC;, Conventional). 
Rejection rates using this rule of thumb look pretty good 
in panel B and are considerably better than the rates using 
robust estimators alone, even with lots of heteroskedasticity, 
as shown in panel A.? 

Since there is no gain without pain, there must be some cost 
to using max(HC;, Conventional). The cost is that the best 
standard error when there is no heteroskedasticity is the con- 
ventional estimate. This is documented in the bottom panel of 
the table. Use of the maximum inflates standard errors unnec- 
essarily under homoskedasticity, depressing rejection rates. 
Nevertheless, the table shows that even in this case, rejection 


°Yang, Hsu, and Zhao (2005) formalize the notion of test procedures 
based on the maximum of a set of test statistics with differing efficiency and 
robustness properties. 
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rates don’t go down all that much. We also view an underes- 
timate of precision as being less costly than an overestimate. 
Underestimating precision, we come away thinking the data 
are not very informative and that we should try to collect more 
or improve the research design, while in the latter case we may 
mistakenly draw important substantive conclusions. 

A final comment on this Monte Carlo investigation con- 
cerns the small sample size. Labor economists like us are used 
to working with tens of thousands of observations or more. 
But sometimes we don’t. In a study of the effects of busing on 
public school students, Angrist and Lang (2004) worked with 
samples of about 3,000 students grouped in 56 schools. The 
regressor of interest in this study varied within grade only at 
the school level, so some of the analysis uses 56 school means. 
Not surprisingly, therefore, Angrist and Lang (2004) obtained 
HC, standard errors below conventional OLS standard errors 
when working with school-level data. As a rule, even if you 
start with the microdata on individuals, when the regressor 
of interest varies at a higher level of aggregation—a school, 
state, or some other group or cluster—effective sample sizes 
are much closer to the number of clusters than to the num- 
ber of individuals. Inference procedures for clustered data are 
discussed in detail in the next section. 


8.2 Clustering and Serial Correlation in Panels 


8.2.1 Clustering and the Moulton Factor 


Heteroskedasticity rarely leads to dramatic changes in infer- 
ence. In large samples where bias is not likely to be a problem, 
we might see standard errors increase by about 25 percent 
when moving from the conventional to the HC, estimator. In 
contrast, clustering can make all the difference. 

The clustering problem can be illustrated using a simple 
bivariate model estimated in data with a group structure. 
Suppose we’re interested in the bivariate regression, 


Yig = Bo + Bixg + eig, (8.2.1) 
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where Yjg is the dependent variable for individual i in cluster 
or group g, with G groups. Importantly, the regressor of inter- 
est, Xg, varies only at the group level. For example, data from 
the STAR experiment analyzed by Krueger (1999) come in the 
form of Yjg, the test score of student i in class g, and class 
SIZE, Xg. 

Although students were randomly assigned to classes in the 
STAR experiment, the STAR data are unlikely to be inde- 
pendent across observations. The test scores of students in 
the same class tend to be correlated because students in the 
same class share background characteristics and are exposed 
to the same teacher and classroom environment. It’s therefore 
prudent to assume that, for students i and j in the same class, g, 


Elejgeig] = peo? > 0, (8.2.2) 


where pe is the residual intraclass correlation coefficient and 
ož is the residual variance. 

Correlation within groups is often modeled using an addi- 
tive random effects model. Specifically, we assume that the 
residual, ej, has a group structure, 


Cig = Vg + Nigs (8.2.3) 


where vz is a random component specific to class g and nig is a 
mean-zero student-level error component that’s left over. We 
focus here on the correlation problem, so both of these error 
components are assumed to be homoskedastic. The group- 
level error component is assumed to capture all within-group 
correlation, so the njg are uncorrelated.'° 

When the regressor of interest varies only at the group level, 
an error structure like (8.2.3) can increase standard errors 
sharply. This unfortunate fact is not news—Kloek (1981) and 


10This sort of residual correlation structure is also a consequence of strat- 
ified sampling (see, e.g., Wooldridge, 2003). Most of the samples that we 
work with are close enough to random that we typically worry more about the 
dependence due to a group structure than clustering due to stratification. Note 
that there is no GLS estimator for equation 8.2.1 with error structure 8.2.3 
because the regressor is fixed within groups. In any case, here as elsewhere we 
prefer a “fix-the-standard-errors” approach to GLS. 


310 Chapter 8 


Moulton (1986) both made the point—but it seems fair to 
say that clustering didn’t really become part of the applied 
econometrics zeitgeist until about 15 years ago. 

Given the error structure, (8.2.3), the intraclass correlation 
coefficient becomes 


2 
o, 


Pe = ’ 
2. 2. 
o +o 


where o,; is the variance of vg and 0; is the variance of nig. 
A word on terminology: pe is called the intraclass corre- 
lation coefficient even when the groups of interest are not 
classrooms. 

Let V.(B;) be the conventional OLS variance formula for the 
regression slope (a diagonal element of 2, in the previous sec- 
tion), while V(81) denotes the correct sampling variance given 
the error structure, (8.2.3). With nonstochastic regressors 
fixed at the group level and groups of equal size, n, we have 


MY 1+(n—1)pe, (8.2.4) 


V.(B1) 


a formula derived in the appendix to this chapter. We call the 
square root of this ratio the Moulton factor, after Moulton’s 
(1986) influential study. Equation (8.2.4) tells us how much 
we overestimate precision by ignoring intraclass correlation. 
Conventional standard errors become increasingly misleading 
as n and pe increase. Suppose, for example, that pe = 1. In 
this case, all the errors within a group are the same, so the 
Yig values are the same as well. Making a data set larger by 
copying a smaller one times generates no new information. 
The variance V,(ĝ1) should therefore be scaled up from V,(A:) 
by a factor of n. The Moulton factor increases with group size 
because with a fixed overall sample size, larger groups mean 
fewer clusters, in which case there is less independent infor- 
mation in the sample (because the data are independent across 
clusters but not within).!" 


With nonstochastic regressors and homoscedastic residuals, the Moulton 
factor is a finite-sample result. Survey statisticians call the Moulton factor the 
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Even small intraclass correlation coefficients can generate a 
big Moulton factor. In Angrist and Lavy (2008), for example, 
4,000 students are grouped in 40 schools, so the average n is 
100. The regressor of interest is school-level treatment sta- 
tus: all students in treated schools were eligible to receive 
cash awards for passing their matriculation exams. The intra- 
class correlation in this study fluctuates around .1. Applying 
formula (8.2.4), the Moulton factor is over 3, so the stan- 
dard errors reported by default are only one-third what they 
should be. 

Equation (8.2.4) covers an important special case where the 
regressors are fixed within groups and group size is constant. 
The general formula allows the regressor, xj, to vary at the 
individual level and for different group sizes, mg. In this case, 
the Moulton factor is the square root of 


V(B1) [Mw _ | 
— = 1 —- — 1 | Px Pe, 8.2.5 
Vip) + = +n Px ( ) 


where 7 is the average group size, and px is the intraclass 
correlation of xjg: 


j ifj 
V (xig) X ng(ng— 1) 
g 


2 2 2 (Xie — 7) (xe - ¥) 


Px = 


Note that p, does not impose a variance components structure 
like (8.2.3); here, px is a generic measure of the correlation of 
regressors within groups. The general Moulton formula tells 
us that clustering has a bigger impact on standard errors with 
variable group sizes and when px is large. The impact vanishes 
when p, = 0. In other words, if the xjz values are uncorrelated 
within groups, the grouped error structure does not matter for 
standard errors. That’s why we worry most about clustering 
when the regressor of interest is fixed within groups. 


design effect because it tells us how much to adjust standard errors in stratified 
samples for deviations from simple random sampling (Kish, 1965). 
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We illustrate formula (8.2.5) using the Tennessee STAR 
example. A regression of kindergartners’ percentile score on 
class size yields an estimate of —.62 with a robust (HC;) stan- 
dard error of .09. In this case, oy = 1 because class size is 
fixed within classes, while V(mg) is positive because classes 
vary in size (in this case, V (ng) = 17.1). The intraclass corre- 
lation coefficient for residuals is .31 and the average class size 
is 19.4. Plugging these numbers into (8.2.5) gives a value of 
about 7 for 7 a x , so that conventional standard errors should 
be multiplied by a factor of 2.65 = /7. The corrected standard 
error is therefore about 0.24. 

The Moulton factor works similarly with 2SLS estimates. In 
particular, we can use (8.2.5), replacing px with pg, where pz 
is the intraclass correlation coefficient of the first-stage fitted 
values and pe is the intraclass correlation of the second-stage 
residuals (Shore-Sheppard, 1996). To understand why this 
works, recall that conventional standard errors for 2SLS are 
derived from the residual variance of the second-stage equa- 
tion divided by the variance of the first-stage fitted values. 
This is the same asymptotic variance formula as for OLS, with 
first-stage fitted values playing the role of the regressor. 

To conclude, we list and compare solutions to the Moul- 
ton problem, starting with the parametric approach described 
above. 


1. Parametric: Fix conventional standard errors using (8.2.5). 
The intraclass correlations pe and px are easy to com- 
pute and supplied as descriptive statistics in some software 
packages. 12 

2. Cluster standard errors: Liang and Zeger (1986) general- 
ize the White (1980a) robust covariance matrix to allow 
for clustering as well as heteroskedasticity. The clustered 
covariance matrix is 


ĝa = (Xx)! XO XplgX¢ xXx! , where 
§ 
(8.2.6) 


12Use Stata’s loneway command, for example. 
P. 


Nonstandard Standard Error Issues 313 


g 
a2 KK Kvn x 
elg lge2g = "1gengg 
Zy êlgêrg êS n . 
| a] 
aoa a a a2 
P1geneg we Eng—1,g°ngg engg 


Here, Xg is the matrix of regressors for group g and a is a 
degrees of freedom adjustment factor similar to that which 
appears in HC41. The clustered estimator is consistent as the 
number of groups gets large given any within-group correla- 
tion structure and not just the parametric model in (8.2.3). 
ĝu is not consistent with a fixed number of groups, how- 
ever, even when the group size tends to infinity. Consistency 
is determined by the law of large numbers, which says that 
we can rely on sample moments to converge to population 
moments (section 3.1.3). But here the sums are at the group 
level and not over individuals. Clustered standard errors are 
therefore unlikely to be reliable with few clusters, a point 
we return to below. 

3. Use group averages instead of microdata: let Yg be the mean 
of Yjg in group g. Estimate 


Yo = Bo + Bixg + eg 


by WLS using the group size as weights. This is equivalent 
to OLS using micro data but the grouped-equation stan- 
dard errors reflect the group structure, (8.2.3).! Again, 
the asymptotics here are based on the number of groups 
and not the group size. Importantly, however, because the 
group means are close to normally distributed with modest 
group sizes, we can expect the good finite-sample properties 
of regression with normal errors to kick in. The standard 
errors that come out of grouped estimation are therefore 
likely to be more reliable than clustered standard errors in 
samples with few clusters. 


13The grouped residuals are heteroskedastic unless group sizes are equal 
but this is less important than the fact that the error has a group structure in 
the microdata. 
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Grouped-data estimation can be generalized to models 
with microcovariates using a two-step procedure. Suppose 
the equation of interest is 


Yig = Bo + Bixg + B2Wig + eig» (8.2.7) 


where Wjg is a covariate that varies within groups. In 
step 1, construct the covariate-adjusted group effects, ug, 
by estimating 


Yig = Ug + 2Wig + Nig. 


The jig, called group effects, are coefficients on a full set of 
group dummies. The estimated jig are group means adjusted 
for differences in the individual level variable, wig. Note 
that, by virtue of (8.2.7) and (8.2.3), ug = Bo + Bixg + vg. 
In step 2, therefore, we regress the estimated group effects 
on group-level variables: 


fig = Bot Bixet{vgt(Ag—ug)} (8.2.8) 


The efficient GLS estimator for (8.2.8) is WLS, using the 
reciprocal of the estimated variance of the group-level resid- 
ual, {vg + (Ag — ug)}, as weights. This can be a problem, 
since the variance of vg is not estimated very well with few 
groups. We might therefore weight by the reciprocal of the 
variance of the estimated group effects, the group size, or use 
no weights at all.!4 In an effort to better approximate the 
relevant finite-sample distribution, Donald and Lang (2007) 
suggest that inference for grouped equations like (8.2.8) be 
based on a ¢-distribution with G — K degrees of freedom. 

Note that the grouping approach does not work when 
Xjg varies within groups. Averaging Xjg tO Xg is a version 
of IV, as we saw in chapter 4. So with micro-variation 
in the regressor of interest, grouped estimation identifies 
parameters that differ from the target parameters in a model 
like (8.2.7). 


'4See Angrist and Lavy (2008) for an example of the latter two weighting 
schemes. 
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4. Block bootstrap: In general, bootstrap inference uses the 
empirical distribution of the data by resampling. But simple 
random resampling won’t do in this case. The trick with 
clustered data is to preserve the dependence structure in the 
target population. We can do this by block bootstrapping, 
that is, drawing blocks of data defined by the groups g. 
In the Tennessee STAR data, for example, we’d block 
bootstrap by resampling entire classes instead of individual 
students. 

5. In some cases, you may be able to estimate a GLS or 
maximum likelihood model based on a version of (8.2.1) 
combined with a model for the error structure like (8.2.3). 
This fixes the clustering problem but also changes the esti- 
mand unless the CEF is linear, as detailed in section 3.4.1 
for LDV models. We therefore prefer other approaches. 


Table 8.2.1 compares standard-error fixups in the STAR 
example. The table reports six estimates: conventional robust 
standard errors (using HC,); two versions of corrected stan- 
dard errors using the Moulton formula (8.2.5), the first using 
the formula for the intraclass correlation given by Moulton 
and the second using Stata’s estimator from the loneway com- 
mand; clustered standard errors; block-bootstrapped standard 
errors; and standard errors from weighted estimation at the 
group level. The coefficient estimate is —.62. In this case, all 
cluster adjustments deliver similar results, a standard error of 
about .23. This happy outcome is due in large part to the fact 
that with 318 classrooms, we have enough clusters for group- 
level asymptotics to work well. With few clusters, however, 
things are much dicier, a point we return to at the end of the 
chapter. 


8.2.2 Serial Correlation in Panels 
and Difference-in-Difference Models 


Serial correlation—the tendency for one observation to be 
correlated with those that have gone before—used to be Some- 
body Else’s Problem, specifically, the unfortunate souls who 
make their living out of time series data (macroeconomists, for 
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TABLE 8.2.1 
Standard errors for class size effects in the STAR 
data (318 clusters) 


Variance Estimator Std. Err. 
Robust (HC,) .090 
Parametric Moulton correction 222 


(using Moulton intraclass correlation) 


Parametric Moulton correction 230 
(using Stata intraclass correlation) 

Clustered 232 
Block bootstrap 231 
Estimation using group means 226 


(weighted by class size) 


Notes: The table reports standard errors for the estimates 
from a regression of kindergartners’ average percentile scores 
on class size using the public use data set from Project STAR. 
The coefficient on class size is —.62. The group level for clus- 
tering is the classroom. The number of observations is 5,743. 
The bootstrap estimate uses 1,000 replications. 


example). Applied microeconometricians have therefore long 
ignored it.!> But our data often have a time dimension, too, 
especially in DD models. This fact combined with clustering 
can have a major impact on statistical inference. 

Suppose, as in section 5.2, that we are interested in the 
effects of a state minimum wage. In this context, the regres- 
sion version of DD includes additive state and time effects. We 
therefore get an equation like (5.2.2), repeated below: 


Yist = Vs + At + Dst + Eist, (8.2.9) 


15The Somebody Else’s Problem (SEP) field, first identified as a natural 
phenomenon in Adams’s Life, the Universe, and Everything, is, according to 
Wikipedia, “a generated energy field that affects perception. . . . Entities within 
the field will be perceived by an outside observer as ‘Somebody Else’s Problem,” 
and will therefore be effectively invisible unless the observer is specifically 
looking for the entity.” 
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As before, Y;s is the outcome for individual i in state s in year 
t and Ds is a dummy variable that indicates treatment states 
in posttreatment periods. 

The error term in (8.2.9) reflects the idiosyncratic variation 
in potential outcomes across people, states, and time. Some 
of this variation is likely to be common to individuals in the 
same state and year, for example, a regional business cycle. We 
can model this common component by thinking of ¢;,; as the 
sum of a state-year shock, vst, and an idiosyncratic individual 
component, nis;. So we have: 


Yist = Vs + Àt + bDst + Use + Nist. (8.2.10) 


We assume that in repeated draws across states and over time, 
E[vs] = 0, while E[n;stl|s, t] = 0 by definition. 

State-year shocks are bad news for DD models. As with 
the Moulton problem, state- and time-specific random effects 
generate a clustering problem that affects statistical inference. 
But that might be the least of our problems in this case. To see 
why, suppose we have only two periods and two states, as in 
the Card and Krueger (1994) New Jersey-Pennsylvania study. 
The empirical DD estimator is 


8CK = (¥s=NJ,t=Nov — Ys=NJ,t=Feb) — (Ys=PA,t=Nov — Y¥s=PA,t=Feb)« 


This estimator is unbiased, since E[v,,] = E[nj] = 0. On the 
other hand, assuming we think of probability limits as increas- 
ing group size while keeping the choice of states and periods 
fixed, state-year shocks render cx inconsistent: 


plim ber 


= ô + {(Us=Ny,t=Nov — Us=N],t=Feb) —(Us=PA,t=Nov — Vs=PA,t=Feb) }- 


Averaging larger and larger samples within New Jersey and 
Pennsylvania in a pair of periods does nothing to eliminate 
the regional shocks specific to a given location and period. 
With only two states and years, we have no way to dis- 
tinguish the differences-in-differences generated by a policy 
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change from the difference-in-dfferences due to the fact that, 
say, the New Jersey economy was holding steady in 1992 
while Pennsylvania was experiencing a cyclical downturn. The 
presence of v,; amounts to a failure of the common trends 
assumption discussed in section 5.2. 

The solution to the inconsistency induced by random shocks 
in differences in differences models is to analyze samples 
including multiple time periods or many states (or both). 
For example, Card (1992) uses 51 states to study minimum 
wage changes, while Card and Krueger (2000) take another 
look at the New Jersey-Pennsylvania experiment with a longer 
monthly time series of payroll data. With multiple states or 
periods, we can hope that the v,, average out to zero. As in the 
first part of this chapter on the Moulton problem, the inference 
framework in this context relies on asymptotic distribution 
theory with many groups and not on group size (or, at least, 
not on group size alone). The most important inference issue 
then becomes the behavior of vst. In particular, if we are pre- 
pared to assume that shocks are independent across states and 
over time—that is, that they are serially uncorrelated—we are 
back to the plain vanilla Moulton problem in section 8.2.1, in 
which case clustering standard errors by state x year should 
generate valid inferences. But in most cases, the assumption 
that vs is serially uncorrelated is hard to defend. Almost 
certainly, for example, regional shocks are highly serially cor- 
related: if things are bad in Pennsylvania in one month, they 
are likely to be about as bad in the next. 

The consequences of serial correlation for clustered panels 
are highlighted by Bertrand, Duflo, and Mullainathan (2004) 
and Kézdi (2004). Any research design with a group structure 
where the group means are correlated can be said to have the 
serial correlation problem. The upshot of recent research on 
serial correlation in data with a group structure is that, just as 
we must adjust our standard errors for the correlation within 
groups induced by the presence of vst, we must further adjust 
for serial correlation in the v, themselves. There are a number 
of ways to do this, not all equally effective in all situations. It 
seems fair to say that the question of how best to approach 
the serial correlation problem is currently under study, and a 
consensus has not yet emerged. 
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The simplest and most widely applied approach is to pass the 
clustering buck one level higher. In the state-year example, we 
can report Liang and Zeger (1986) standard errors clustered by 
state instead of by state and year (e.g., using Stata cluster). 
This might seem odd at first blush, since the model controls 
for state effects. The state effect, ys, in (8.2.10) removes the 
state mean of vst, which we denote by Vs. Nevertheless, Vst — Us 
is probably still serially correlated. Clustering standard errors 
at the state level takes account of this, since the one-level-up 
clustered covariance estimator allows for unrestricted residual 
correlation within clusters, including the time series correla- 
tion in vy — Us. This is a quick and easy fix.!© The problem 
here is that passing the buck up one level reduces the number 
of clusters. And asymptotic inference supposes we have a large 
number of clusters because we need many states or periods to 
estimate the correlation between vs — 7U, and Vs-1 — 7U, rea- 
sonably well. A paucity of clusters can lead to biased standard 
errors and misleading inferences. 


8.2.3 Fewer than 42 Clusters 


Bias from few clusters is a risk in both the Moulton and the 
serial correlation contexts because in both cases, inference is 
cluster-based. With few clusters, we tend to underestimate 
either the serial correlation in a random shock like vst or the 
intraclass correlation, pe, in the Moulton problem. The rele- 
vant dimension for counting clusters in the Moulton problem 
is the number of groups, G. In a DD scenario where you’d like 
to cluster on state or some other cross-sectional dimension, 
the relevant dimension for counting clusters is the number of 
states or cross-sectional groups. Therefore, following Douglas 
Adams’s dictum that the ultimate answer to life, the universe, 
and everything is 42, we believe the question is: How many 
clusters are enough for reliable inference using the standard 
cluster adjustment derived from (8.2.6)? 

If 42 is enough for the standard cluster adjustment to be 
reliable, and if less is too few, then what should you do when 


16 Arellano (1987) appears to have been the first to suggest higher-level 
clustering for models with a panel structure. 
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the cluster count is low? First-best is to get more clusters by 
collecting more data. But sometimes we’re too lazy for that, 
or the number of groups is naturally fixed, so other ideas are 
detailed below. It’s worth noting at the outset that not all of 
these ideas are equally well-suited for the Moulton and serial 
correlation problems. 


1. Bias correction of clustered standard errors: Clustered stan- 

dard errors are biased in small samples because E(égéy) # 
E(egey) = Wg, just as with the residual covariance matrix 
in section 8.1. Usually, E(égéy) is too small. One solution 
is to inflate residuals in the hopes of reducing bias. Bell and 
McCaffrey (2002) suggest a procedure (called bias-reduced 


linearization, or BRL) that adjusts residuals by 


iy — WB ol 
Wy = degey 
eg = Ages 


where Ag solves 


and a is a degrees-of-freedom correction. 
This is a version of HC) for the clustered case. BRL 
works for the straight-up Moulton problem with few clus- 


ters but for technical reasons cannot be used for the typical 


DD serial correlation problem.!7 


17The matrix Ag is not unique; there are many such decompositions. Bell 
and McCaffrey (2002) use the symmetric square root of (I — Hy, or 


Ag = RA!?, 


where R is the matrix of eigenvectors of (I — Ha)! and A!/? is the diagonal 
matrix of the square roots of the eigenvalues. One problem with the Bell and 
McCaffrey adjustment is that (I — Hg) may not be of full rank, and hence the 
inverse may not exist for all designs. This happens, for example, when one of 
the regressors is a dummy variable that is one for exactly one of the clusters, 
and zero otherwise. This scenario occurs in the panel DD model discussed by 
Bertrand et al. (2004), which includes a full set of state dummies and clusters 
by state. 
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2. Recognizing that the fundamental unit of observation is a 
cluster and not an individual unit within clusters, Bell and 
McCaffrey (2002) and Donald and Lang (2007) suggest that 
inference be based on a t-distribution with G — K degrees of 
freedom rather than on the standard normal distribution. 
For small G, this makes a difference: confidence intervals 
will be wider, thereby avoiding some mistakes. Cameron, 
Gelbach, and Miller (2008) report Monte Carlo examples 
where the combination of a BRL adjustment and the use of 
t-tables works well. 

3. Donald and Lang (2007) argue that estimation using group 
means works well with small G in the Moulton problem, 
and even better when inference is based on a f-distribution 
with G — K degrees of freedom. But, as we discussed in sec- 
tion 8.2.1, for grouped estimation the regressor should be 
fixed within groups. The level of aggregation is the level 
at which you’d like to cluster, such as schools in Angrist 
and Lavy (2008). For serial correlation, this is the state, but 
state averages cannot be used to estimate a model with a 
full set of state effects. Also, since treatment status varies 
within states, averaging up to the state level averages the 
regressor of interest as well, changing the rules of the game 
in a way we may not like (the estimator becomes IV using 
group dummies as instruments). The group means approach 
is therefore out of bounds for the serial correlation problem. 
Note also that if the grouped residuals are heteroskedastic, 
and you therefore use robust standard errors, you may have 
to worry about bias of the form discussed in section 8.1. 
In some cases, heteroskedasticity in the grouped residuals 
can be fixed by weighting by the group size. But weight- 
ing changes the estimand when the CEF is nonlinear, so the 
case for weighting is not open and shut (Angrist and Lavy, 
1999, chose not to weight school-level averages because the 
variation in their study comes mostly from small schools). 
Weighted or not, a conservative approach when working 
with group-level averages is to use our rule of thumb from 
section 8.1: take the larger of robust and conventional 
standard errors as your measure of precision. 
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4. Cameron, Gelbach, and Miller (2008) report that some 
forms of a block bootstrap work well with small numbers 
of groups, and that the block bootstrap typically outper- 
forms Stata-clustered standard errors. This appears to be 
true both for the Moulton and serial correlation problems. 
But Cameron, Gelbach, and Miller (2008) focus on rejec- 
tion rates using (pivotal) test statistics, while we like to see 
standard errors. 

5. Parametric corrections: For the Moulton problem, this 
amounts to use of the Moulton factor. With serial cor- 
relation, this means correcting your standard errors for 
first-order serial correlation at the group level. Based on 
our sampling experiments with the Moulton problem and a 
reading of the literature, parametric approaches may work 
well, and better than the nonparametric cluster estimator 
(8.2.6), especially if the parametric model is not too far 
off (see, e.g., Hansen, 2007a, which also proposes a bias 
correction for estimates of serial correlation parameters). 
Unfortunately, however, beyond the greenhouse world of 
controlled Monte Carlo studies, we’re unlikely to know 
whether parametric assumptions are a good fit. 


Alas, the bottom line here is not entirely clear, nor is the 
more basic question of when few clusters are fatal for infer- 
ence. The severity of the resulting bias seems to depend on the 
nature of your problem, in particular whether you confront 
straight-up Moulton or serial correlation issues. Aggregation 
to the group level as in Donald and Lang (2007) seems to 
work well in the Moulton case as long as the regressor of 
interest is fixed within groups and there is not too much 
underlying heteroskedasticity. At a minimum, you'd like to 
show that your conclusions are consistent with the inferences 
that arise from an analysis of group averages, since this is 
a conservative and transparent approach. Angrist and Lavy 
(2008) use BRL standard errors to adjust for clustering at 
the school level but validate this approach by showing that 
key results come out the same using covariate-adjusted group 
averages. 
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As far as serial correlation goes, most of the evidence sug- 
gests that when you are lucky enough to do research on U.S. 
states, giving 51 clusters, you are on reasonably safe ground 
with a naive application of Stata’s cluster command at 
the state level. But you might have to study Canada, which 
offers only 10 clusters in the form of provinces, well below 
42. Hansen (2007b) finds that Liang and Zeger (1986) (Stata- 
clustered) standard errors are reasonably good at correcting 
for serial correlation in panels, even in the Canadian scenario. 
Hansen also recommends use of a ¢-distribution with G — K 
degrees of freedom for critical values. 

Clustering problems have forced applied microeconometri- 
cians to eat a little humble pie. Proud of working with large 
microdata sets, we like to sneer at macroeconomists toying 
with small time series samples. But he who laughs last laughs 
best: if the regressor of interest varies only at a coarse group 
level, such as over time or across states or countries, then it’s 
the macroeconomists who have had the most realistic mode of 
inference all along. 


8.3 Appendix: Derivation of the Simple 
Moulton Factor 


Write 
Yig eig 
Y2g 29 
Ye=] . Cg 
Yngg Engg 
and 
yı 41X1 e 
y2 12X2 e2 
y — a e= 5 


Ve lgXeq eg 
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where tg is a column vector of ng ones and G is the number of 
groups. Note that 


yı 0 0 
E(ee’) = Y = oN 
eh 
0 0 WW 
1 pe Pe 
Y, = oe ie 1 ilies o la = pe) + petgl| > 
: n Pe 
p TE Pe 1 
h oj 
where pe = PAE 
Now 
X'X = DEEA 
g 
X'YX = XO xet, ek 
g 
But 
1+ (rig — 1)pe 
Xgl Velox = O? Xgl, + (Me — 1)pe xe 
1+ (rg — 1)pe 


= 0; Ng [1 + (ng — 1)pe] Kgn: 


Let t, = 1 + (ng — 1) pe, so we get 


1 Ce ot E: 1 
Xglo PglgX, = Og NgTgXgX o 


F EE: i 
X'WX =o; > NgTgXgX g. 
g 
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With this in hand, we can write 


V(B) = See ee 
= 0; 192 nexe) Di metetet (X NgXgX I 


We want to compare this - the standard OLS covariance 


estimator 
V.(B) = o (Zren) 


If the group sizes are equal, ng = nand Ttg = T = 1 + (n — 1)pe, 
so that 


= tV.(B), 
which implies (8.2.4). 
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LAST WORDS 


MMM, 


But it’s not as hard as the dense pages of Econometrica 

might lead you to believe. Carefully applied to coherent 
causal questions, regression and 2SLS almost always make 
sense. Your standard errors probably won’t be quite right, 
but they rarely are. Avoid embarrassment by being your own 
best skeptic, and especially, DON’T PANIC! 


f applied econometrics were easy, theorists would do it. 
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ACRONYMS AND ABBREVIATIONS 


MM a 


TECHNICAL TERMS 


2SLS Two-stage least squares, an instrumental variables 
(IV) estimator. 

ACR Average causal response, the weighted average causal 
response to an ordered treatment. 

ANOVA Analysis of variance, a decomposition of total 
variance into the variance of the conditional expectation 
function (CEF) and the average conditional variance. 

BRL Biased reduced linearization estimator, a 
bias-corrected covariance matrix estimator for clustered 
data. 

CDF Cumulative distribution function, the probability that 
a random variable takes on a value less than or equal to 
a given number. 

CEF Conditional expectation function, the population 
average of y; with X; held fixed. 

CIA Conditional independence assumption, a core 
assumption that justifies a causal interpretation of 
regression and matching estimators. 

COP Conditional on positive effect, the treatment-control 
difference in means for a non-negative random variable 
looking at positive values only. 

CQF Conditional quantile function, defined for each 
quantile t, the t-quantile of y;, holding X; fixed. 

DD _Differences-in-differences estimator. In its simplest 
form, a comparison of changes over time in treatment 
and control groups. 

GLS Generalized least squares estimator, a regression 
estimator for models with heteroskedasticity and/or 
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serial correlation. GLS provides efficiency gains when the 
conditional expectation function (CEF) is linear. 

GMM Generalized method of moments, an econometric 
estimation framework in which estimates are chosen to 
minimize a matrix-weighted average of the squared 
difference between sample and population moments. 

HCo-HC;3_ Heteroskedasticity consistent covariance matrix 
estimators discussed by MacKinnon and White (1985). 

ILS Indirect least squares estimator, the ratio of 
reduced-form to first-stage coefficients in an instrumental 
variables (IV) setup. 

ITT Intention to treat effect, the effect of being offered 
treatment. 

IV Instrumental variables estimator or method. 

JIVE Jackknife instrumental variables (IV) estimator. 

LATE Local average treatment effect, the causal effect of 
treatment on compliers. 

LDVs Limited dependent variables, such as dummies, 
counts, and non-negative random variables on the 
left-hand side of regression and related statistical models. 

LIML Limited information maximum likelihood estimator, 
an alternative to two-stage least squares (2SLS) with less 
bias. 

LM Lagrange multiplier test, a statistical test of the 
restrictions imposed by an estimator. 

LPM Linear probability model, a linear regression model 
for a dummy dependent variable. 

MFX Marginal effects. In nonlinear models, the derivative 
of the conditional expectation function (CEF) implied by 
the model with respect to the regressors. 

MMSE Minimum mean squared error, the minimum 
expected squared prediction error, or the minimum of 
the expected square of the difference between an 
estimator and a target. 

OLS Ordinary least squares estimator, the sample analog of 
the population regression vector. 
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OVB Omitted variables bias, the relationship between 
regression estimates in models with different sets of 
control variables. 

QTE Quantile treatment effect, the causal effect of 
treatment on conditional quantiles of the outcome 
variable for compliers. 

RD Regression discontinuity design, an identification 
strategy in which treatment, the probability of treatment, 
or the average treatment intensity is a known, 
discontinuous function of a covariate. 

SEM Simultaneous equations model, an econometric 
framework in which causal relationships between 
variables are described by several equations. 


SSIV Split-sample instrumental variables estimator, a 
version of the two-sample instrumental variables (TSIV) 
estimator. 

TSIV Two-sample instrumental variables estimator, an 
instrumental variables (IV) estimator that can sometimes 
be constructed from two data sets when either data set 
alone would be inadequate. 

VIV Visual instrumental variables, a plot of reduced form 
against first-stage fitted values in instrumental variables 
models with dummy instruments. 


WLS Weighted least squares, a GLS estimator with a 
diagonal weighting matrix. 


DATA SETS AND VARIABLE NAMES 


AFDC Aid to Families with Dependent Children, an 
American welfare program no longer in effect. 

AFQT Armed Forces Qualification Test, used by the U.S. 
armed forces to gauge recruits’ academic and cognitive 
ability. 

CPS Current Population Survey, a large monthly survey of 
U.S. households, source of the U.S. unemployment rate. 
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GED General Educational Development certificate, a 
substitute for traditional high school credentials, 
obtained by passing a test. 

IPUMS Integrated public use microdata series, consistently 
coded samples of census records from the United States 
and other countries. 

NHIS National Health Interview Survey, a large American 
survey with many questions related to health. 

NLSY National Longitudinal Survey of Youth, a 
long-running panel survey that started with a high 
school-aged cohort in 1979. 

PSAT Preliminary SAT, qualifies American high school 
sophomores for a National Merit Scholarship. 

PSID Panel Study of Income Dynamics, a panel survey of 
American households begun in 1968. 

QOB Quarter of birth. 


RSN Random sequence numbers, draft lottery numbers 
randomly assigned to dates of birth in the Vietnam-era 
draft lotteries held from 1970 to 1973. 

SDA Service delivery area, one of the 649 sites where Job 
Training Partnership Act (JTPA) services were delivered. 

SSA Social Security Administration, a U.S. government 
agency. 


STUDY NAMES 


HIE Health Insurance Experiment conducted by the RAND 
Corporation, a randomized trial in which participants 
were exposed to health insurance programs with 
different features. 

JTPA Job Training Partnership Act, a large, federally 
funded training program that included a randomized 
evaluation. 

MDVE Minneapolis Domestic Violence Experiment, a 
randomized trial in which police response to a domestic 
disturbance was determined in part by random 
assignment. 
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NSW National Supported Work demonstration, an 
experimental mid-1970s training program that provided 
work experience to men and women with weak labor 
force attachment. 

STAR The Tennessee Student/Teacher Achievement Ratio 
experiment, a randomized study of elementary school 
class size. 

WHI Women’s Health Initiative, a series of randomized 
trials that included an evaluation of hormone 
replacement therapy. 
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EMPIRICAL STUDIES INDEX 


MM a 


This index lists studies contributing to tables and figures in the 


book. 


Abadie, Angrist, and Imbens (2002) Constructs QTE (IV) 
estimates of the effect of subsidized JTPA training on the 
distribution of trainee earnings. Discussed in 
section 7.2.1. Results appear in table 7.2.1. 

Acemoglu and Angrist (2000) Uses compulsory schooling 
laws and quarter of birth to construct IV estimates of the 
economic returns to schooling. Discussed in 
section 4.5.3. Results appear in table 4.4.2 and 
figure 4.5.1. 

Angrist (1990) Uses the draft lottery to construct IV 
estimates of the effect of military service on earnings. 
Discussed in sections 4.1.2 and 4.1.3. Results appear in 
tables 4.1.3 and 4.4.2. 

Angrist (1998) Estimates the effect of voluntary military 
service on civilian earnings using matching, regression, 
and IV. Discussed in section 3.3.1. Results appear in 
table 3.3.1. 

Angrist (2001) Compares OLS and IV with marginal effects 
estimates using nonlinear models. Discussed in 
section 4.6.3. Results appear in table 4.6.1. 

Angrist and Evans (1998) Uses sibling sex composition and 
twin births to construct IV estimates of the effects of 
family size on mothers’ and fathers’ labor supply. 
Discussed in sections 3.4.2 and 4.6.3. Results appear in 
tables 3.4.2, 4.4.2, and 4.6.1. 

Angrist and Imbens (1995) Shows that 2SLS estimates can 
be interpreted as the weighted average causal response to 
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treatment. Discussed in section 4.5.3. Results appear in 
table 4.1.2. 

Angrist and Krueger (1991) Uses quarter of birth to 
construct IV estimates of the economic returns to 
schooling. Discussed in sections 4.1, 4.5.3, and 4.6.4. 
Results appear in figure 4.1.1 and tables 4.1.1, 4.1.2, 
4.4.2, and 4.6.2. 

Angrist and Lavy (1999) Uses a fuzzy RD to estimate the 
effects of class size on student achievement. Discussed in 
section 6.2. Results appear in figure 6.2.1 and 
table 6.2.1. 

Angrist, Chernozhukov and Fernandez-Val (2006) Shows 
that quantile regression generates a MMSE 
approximation to a nonlinear CQF, and illustrates the 
quantile regression approximation property by 
estimating the effects of schooling on the distribution of 
wages. Discussed in section 7.1.2. Results appear in 
table 7.1.1 and figure 7.1.1. 

Autor (2003) Uses state variation in employment 
protection laws to construct DD estimates of the effect of 
labor market regulation on temporary employment. 
Discussed in section 5.2.1. Results appear in figure 5.2.4. 

Besley and Burgess (2004) Use state variation to estimate 
the effect of labor laws on firm performance in India. 
Discussed in section 5.2.1. Results appear in table 5.2.3. 

Bloom, et al. (1997) Reports the JIPA main findings. 
Discussed in section 4.4.3. Results appear in table 4.4.1. 

Card (1992) Uses state minimum wages and regional 
variation in wage levels to estimate the effect of the 
minimum wage. Discussed in section 5.2.1. Results 
appear in table 5.2.2. 

Card and Krueger (1994, 2000) Use a New Jersey 
minimum wage increase to estimate the employment 
effects of a minimum wage change. Discussed in 
section 5.2. Results appear in table 5.2.1 and figure 5.2.2. 

Dehejia and Wahba (1999) Uses the propensity score to 
estimate the effects of subsidized training on earnings in 
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a reanalysis of the Lalonde (1986) NSW sample. 
Discussed in section 3.3.3. Results appear in table 3.3.2. 

Freeman (1984) Uses fixed effects models to construct 
panel-data estimates of the effect of union status on 
wages. Discussed in section 5.1. Results appear in 
table 5.1.1. 

Krueger (1999) Uses the Tennessee STAR randomized trial 
to construct IV estimates of the effect of class size on test 
scores. Discussed in section 2.2. Results appear in 
tables 2.2.1, 2.2.2, and 8.2.1. 

Lee (2008) Uses a regression discontinuity design to 
estimate the effect of party incumbency on reelection. 
Discussed in section 6.1. Results appear in figure 6.1.2. 

Manning et al. (1987) Uses randomized assignment to 
estimate the impact of health insurance plans on health 
care use, cost, and outcomes. Discussed in section 3.4.2. 
Results appear in table 3.4.1. 

Pischke (2007) Uses a sharp change in the length of the 
German school year to estimate the effect of school term 
length on achievement. Discussed in section 5.2. Results 
appear in figure 5.2.3. 
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